Computation and Language
☆ UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models
Large Language Models (LLMs) are vulnerable to attacks like prompt injection,
backdoor attacks, and adversarial attacks, which manipulate prompts or models
to generate harmful outputs. In this paper, departing from traditional deep
learning attack paradigms, we explore their intrinsic relationship and
collectively term them Prompt Trigger Attacks (PTA). This raises a key
question: Can we determine if a prompt is benign or poisoned? To address this,
we propose UniGuardian, the first unified defense mechanism designed to detect
prompt injection, backdoor attacks, and adversarial attacks in LLMs.
Additionally, we introduce a single-forward strategy to optimize the detection
pipeline, enabling simultaneous attack detection and text generation within a
single forward pass. Our experiments confirm that UniGuardian accurately and
efficiently identifies malicious prompts in LLMs.
comment: 18 Pages, 8 Figures, 5 Tables, Keywords: Attack Defending, Security,
Prompt Injection, Backdoor Attacks, Adversarial Attacks, Prompt Trigger
Attacks
☆ Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions
Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić
We present an end-to-end framework for generating synthetic users for
evaluating interactive agents designed to encourage positive behavior changes,
such as in health and lifestyle coaching. The synthetic users are grounded in
health and lifestyle conditions, specifically sleep and diabetes management in
this study, to ensure realistic interactions with the health coaching agent.
Synthetic users are created in two stages: first, structured data are generated
grounded in real-world health and lifestyle factors in addition to basic
demographics and behavioral attributes; second, full profiles of the synthetic
users are developed conditioned on the structured data. Interactions between
synthetic users and the coaching agent are simulated using generative
agent-based models such as Concordia, or directly by prompting a language
model. Using two independently-developed agents for sleep and diabetes coaching
as case studies, the validity of this framework is demonstrated by analyzing
the coaching agent's understanding of the synthetic users' needs and
challenges. Finally, through multiple blinded evaluations of user-coach
interactions by human experts, we demonstrate that our synthetic users with
health and behavioral attributes more accurately portray real human users with
the same attributes, compared to generic synthetic users not grounded in such
attributes. The proposed framework lays the foundation for efficient
development of conversational agents through extensive, realistic, and grounded
simulated interactions.
☆ Rethinking Diverse Human Preference Learning through Principal Component Analysis
Understanding human preferences is crucial for improving foundation models
and building personalized AI systems. However, preferences are inherently
diverse and complex, making it difficult for traditional reward models to
capture their full range. While fine-grained preference data can help,
collecting it is expensive and hard to scale. In this paper, we introduce
Decomposed Reward Models (DRMs), a novel approach that extracts diverse human
preferences from binary comparisons without requiring fine-grained annotations.
Our key insight is to represent human preferences as vectors and analyze them
using Principal Component Analysis (PCA). By constructing a dataset of
embedding differences between preferred and rejected responses, DRMs identify
orthogonal basis vectors that capture distinct aspects of preference. These
decomposed rewards can be flexibly combined to align with different user needs,
offering an interpretable and scalable alternative to traditional reward
models. We demonstrate that DRMs effectively extract meaningful preference
dimensions (e.g., helpfulness, safety, humor) and adapt to new users without
additional training. Our results highlight DRMs as a powerful framework for
personalized and interpretable LLM alignment.
comment: 14 pages
☆ Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning
Recent advances in Large Language Models (LLMs) have enabled them to process
increasingly longer sequences, ranging from 2K to 2M tokens and even beyond.
However, simply extending the input sequence length does not necessarily lead
to effective long-context understanding. In this study, we integrate
Chain-of-Thought (CoT) reasoning into LLMs in a supervised manner to facilitate
effective long-context understanding. To achieve this, we introduce
LongFinanceQA, a synthetic dataset in the financial domain designed to improve
long-context reasoning. Unlike existing long-context synthetic data,
LongFinanceQA includes intermediate CoT reasoning before the final conclusion,
which encourages LLMs to perform explicit reasoning, improving accuracy and
interpretability in long-context understanding. To generate synthetic CoT
reasoning, we propose Property-driven Agentic Inference (PAI), an agentic
framework that simulates human-like reasoning steps, including property
extraction, retrieval, and summarization. We evaluate PAI's reasoning
capabilities by assessing GPT-4o-mini w/ PAI on the Loong benchmark,
outperforming standard GPT-4o-mini by 20.0%. Furthermore, we fine-tune
LLaMA-3.1-8B-Instruct on LongFinanceQA, achieving a 24.6% gain on Loong's
financial subset.
comment: 15 Pages, 6 Tables, 8 Figures
☆ RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises
Recent advances in large language models (LLMs) have shown that they can
answer questions requiring complex reasoning. However, their ability to
identify and respond to text containing logical fallacies or deliberately
misleading premises remains less studied. To address this gap, we introduce
RuozhiBench, a bilingual dataset comprising 677 carefully curated questions
that contain various forms of deceptive reasoning, meticulously crafted through
extensive human effort and expert review. In a comprehensive evaluation of 17
LLMs from 5 Series over RuozhiBench using both open-ended and two-choice
formats, we conduct extensive analyses on evaluation protocols and result
patterns. Despite their high scores on conventional benchmarks, these models
showed limited ability to detect and reason correctly about logical fallacies,
with even the best-performing model, Claude-3-haiku, achieving only 62%
accuracy compared to the human of more than 90%.
☆ NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions
Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E Weston, Xian Li
Scaling reasoning capabilities beyond traditional domains such as math and
coding is hindered by the lack of diverse and high-quality questions. To
overcome this limitation, we introduce a scalable approach for generating
diverse and challenging reasoning questions, accompanied by reference answers.
We present NaturalReasoning, a comprehensive dataset comprising 2.8 million
questions that span multiple domains, including STEM fields (e.g., Physics,
Computer Science), Economics, Social Sciences, and more. We demonstrate the
utility of the questions in NaturalReasoning through knowledge distillation
experiments which show that NaturalReasoning can effectively elicit and
transfer reasoning capabilities from a strong teacher model. Furthermore, we
demonstrate that NaturalReasoning is also effective for unsupervised
self-training using external reward models or self-rewarding.
comment: Dataset at https://huggingface.co/datasets/facebook/natural_reasoning
☆ Adapting Psycholinguistic Research for LLMs: Gender-inclusive Language in a Coreference Context ACL 2025
Gender-inclusive language is often used with the aim of ensuring that all
individuals, regardless of gender, can be associated with certain concepts.
While psycholinguistic studies have examined its effects in relation to human
cognition, it remains unclear how Large Language Models (LLMs) process
gender-inclusive language. Given that commercial LLMs are gaining an
increasingly strong foothold in everyday applications, it is crucial to examine
whether LLMs in fact interpret gender-inclusive language neutrally, because the
language they generate has the potential to influence the language of their
users. This study examines whether LLM-generated coreferent terms align with a
given gender expression or reflect model biases. Adapting psycholinguistic
methods from French to English and German, we find that in English, LLMs
generally maintain the antecedent's gender but exhibit underlying masculine
bias. In German, this bias is much stronger, overriding all tested
gender-neutralization strategies.
comment: 9 pages, 7 figures, submitted to ACL 2025 (ARR February 2025 cycle)
☆ STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models
How should one judge whether a given large language model (LLM) can reliably
perform economic reasoning? Most existing LLM benchmarks focus on specific
applications and fail to present the model with a rich variety of economic
tasks. A notable exception is Raman et al. [2024], who offer an approach for
comprehensively benchmarking strategic decision-making; however, this approach
fails to address the non-strategic settings prevalent in microeconomics, such
as supply-and-demand analysis. We address this gap by taxonomizing
microeconomic reasoning into $58$ distinct elements, focusing on the logic of
supply and demand, each grounded in up to $10$ distinct domains, $5$
perspectives, and $3$ types. The generation of benchmark data across this
combinatorial space is powered by a novel LLM-assisted data generation protocol
that we dub auto-STEER, which generates a set of questions by adapting
handwritten templates to target new domains and perspectives. Because it offers
an automated way of generating fresh questions, auto-STEER mitigates the risk
that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that
it will serve as a useful tool both for evaluating and fine-tuning models for
years to come. We demonstrate the usefulness of our benchmark via a case study
on $27$ LLMs, ranging from small open-source models to the current state of the
art. We examined each model's ability to solve microeconomic problems across
our whole taxonomy and present the results across a range of prompting
strategies and scoring metrics.
comment: 18 pages, 11 figures
☆ The influence of motion features in temporal perception
This paper examines the role of manner-of-motion verbs in shaping subjective
temporal perception and emotional resonance. Through four complementary
studies, we explore how these verbs influence the conceptualization of time,
examining their use in literal and metaphorical (temporal) contexts. Our
findings reveal that faster verbs (e.g., fly, zoom) evoke dynamic and engaging
temporal experiences, often linked to positive emotions and greater agency. In
contrast, slower verbs (e.g., crawl, drag) convey passivity, monotony, and
negative emotions, reflecting tedious or constrained experiences of time. These
effects are amplified in metaphorical contexts, where manner verbs encode
emotional and experiential nuances that transcend their literal meanings. We
also find that participants prefer manner verbs over path verbs (e.g., go,
pass) in emotionally charged temporal contexts, as manner verbs capture the
experiential and emotional qualities of time more effectively. These findings
highlight the interplay between language, motion, and emotion in shaping
temporal perception, offering insights into how linguistic framing influences
subjective experiences of time.
☆ Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization
Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar
Clinical Question Answering (CQA) plays a crucial role in medical
decision-making, enabling physicians to extract relevant information from
Electronic Medical Records (EMRs). While transformer-based models such as BERT,
BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in
CQA, existing models lack the ability to categorize extracted answers, which is
critical for structured retrieval, content filtering, and medical decision
support.
To address this limitation, we introduce a Multi-Task Learning (MTL)
framework that jointly trains CQA models for both answer extraction and medical
categorization. In addition to predicting answer spans, our model classifies
responses into five standardized medical categories: Diagnosis, Medication,
Symptoms, Procedure, and Lab Reports. This categorization enables more
structured and interpretable outputs, making clinical QA models more useful in
real-world healthcare settings.
We evaluate our approach on emrQA, a large-scale dataset for medical question
answering. Results show that MTL improves F1-score by 2.2% compared to standard
fine-tuning, while achieving 90.7% accuracy in answer categorization. These
findings suggest that MTL not only enhances CQA performance but also introduces
an effective mechanism for categorization and structured medical information
retrieval.
☆ Understanding and Rectifying Safety Perception Distortion in VLMs
Recent studies reveal that vision-language models (VLMs) become more
susceptible to harmful requests and jailbreak attacks after integrating the
vision modality, exhibiting greater vulnerability than their text-only LLM
backbones. To uncover the root cause of this phenomenon, we conduct an in-depth
analysis and identify a key issue: multimodal inputs introduce an
modality-induced activation shift toward a "safer" direction compared to their
text-only counterparts, leading VLMs to systematically overestimate the safety
of harmful inputs. We refer to this issue as safety perception distortion. To
mitigate such distortion, we propose Activation Shift Disentanglement and
Calibration (ShiftDC), a training-free method that decomposes and calibrates
the modality-induced activation shift to reduce the impact of modality on
safety. By isolating and removing the safety-relevant component, ShiftDC
restores the inherent safety alignment of the LLM backbone while preserving the
vision-language capabilities of VLMs. Empirical results demonstrate that
ShiftDC significantly enhances alignment performance on safety benchmarks
without impairing model utility.
☆ Text2World: Benchmarking Large Language Models for Symbolic World Model Generation
Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Shao, Ping Luo
Recently, there has been growing interest in leveraging large language models
(LLMs) to generate symbolic world models from textual descriptions. Although
LLMs have been extensively explored in the context of world modeling, prior
studies encountered several challenges, including evaluation randomness,
dependence on indirect metrics, and a limited domain scope. To address these
limitations, we introduce a novel benchmark, Text2World, based on planning
domain definition language (PDDL), featuring hundreds of diverse domains and
employing multi-criteria, execution-based metrics for a more robust evaluation.
We benchmark current LLMs using Text2World and find that reasoning models
trained with large-scale reinforcement learning outperform others. However,
even the best-performing model still demonstrates limited capabilities in world
modeling. Building on these insights, we examine several promising strategies
to enhance the world modeling capabilities of LLMs, including test-time
scaling, agent training, and more. We hope that Text2World can serve as a
crucial resource, laying the groundwork for future research in leveraging LLMs
as world models. The project page is available at
https://text-to-world.github.io/.
comment: Project page: https://text-to-world.github.io/
☆ KAPPA: A Generic Patent Analysis Framework with Keyphrase-Based Portraits
Patent analysis highly relies on concise and interpretable document
representations, referred to as patent portraits. Keyphrases, both present and
absent, are ideal candidates for patent portraits due to their brevity,
representativeness, and clarity. In this paper, we introduce KAPPA, an
integrated framework designed to construct keyphrase-based patent portraits and
enhance patent analysis. KAPPA operates in two phases: patent portrait
construction and portrait-based analysis. To ensure effective portrait
construction, we propose a semantic-calibrated keyphrase generation paradigm
that integrates pre-trained language models with a prompt-based hierarchical
decoding strategy to leverage the multi-level structural characteristics of
patents. For portrait-based analysis, we develop a comprehensive framework that
employs keyphrase-based patent portraits to enable efficient and accurate
patent analysis. Extensive experiments on benchmark datasets of keyphrase
generation, the proposed model achieves significant improvements compared to
state-of-the-art baselines. Further experiments conducted on real-world patent
applications demonstrate that our keyphrase-based portraits effectively capture
domain-specific knowledge and enrich semantic representation for patent
analysis tasks.
☆ Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity
A range of recent works addresses the problem of compression of sequence of
tokens into a shorter sequence of real-valued vectors to be used as inputs
instead of token embeddings or key-value cache. These approaches allow to
reduce the amount of compute in existing language models. Despite relying on
powerful models as encoders, the maximum attainable lossless compression ratio
is typically not higher than x10. This fact is highly intriguing because, in
theory, the maximum information capacity of large real-valued vectors is far
beyond the presented rates even for 16-bit precision and a modest vector size.
In this work, we explore the limits of compression by replacing the encoder
with a per-sample optimization procedure. We show that vectors with compression
ratios up to x1500 exist, which highlights two orders of magnitude gap between
existing and practically attainable solutions. Furthermore, we empirically show
that the compression limits are determined not by the length of the input but
by the amount of uncertainty to be reduced, namely, the cross-entropy loss on
this sequence without any conditioning. The obtained limits highlight the
substantial gap between the theoretical capacity of input embeddings and their
practical utilization, suggesting significant room for optimization in model
design.
☆ Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection
Hateful memes have become a significant concern on the Internet,
necessitating robust automated detection systems. While large multimodal models
have shown strong generalization across various tasks, they exhibit poor
generalization to hateful meme detection due to the dynamic nature of memes
tied to emerging social trends and breaking news. Recent work further
highlights the limitations of conventional supervised fine-tuning for large
multimodal models in this context. To address these challenges, we propose
Large Multimodal Model Retrieval-Guided Contrastive Learning (LMM-RGCL), a
novel two-stage fine-tuning framework designed to improve both in-domain
accuracy and cross-domain generalization. Experimental results on six widely
used meme classification datasets demonstrate that LMM-RGCL achieves
state-of-the-art performance, outperforming agent-based systems such as
VPD-PALI-X-55B. Furthermore, our method effectively generalizes to
out-of-domain memes under low-resource settings, surpassing models like GPT-4o.
comment: Preprint. Under Review
☆ SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Weixiao Zhou, Yunhong Lu, Tongliang Li, Wenhao Huang, Zhoujun Li
The increasing application of multi-modal large language models (MLLMs)
across various sectors have spotlighted the essence of their output reliability
and accuracy, particularly their ability to produce content grounded in factual
information (e.g. common and domain-specific knowledge). In this work, we
introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate
the factuality ability of MLLMs to answer natural language short questions.
SimpleVQA is characterized by six key features: it covers multiple tasks and
multiple scenarios, ensures high quality and challenging queries, maintains
static and timeless reference answers, and is straightforward to evaluate. Our
approach involves categorizing visual question-answering items into 9 different
tasks around objective events or common knowledge and situating these within 9
topics. Rigorous quality control processes are implemented to guarantee
high-quality, concise, and clear answers, facilitating evaluation with minimal
variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a
comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into
their image comprehension and text generation abilities by identifying and
analyzing error cases.
☆ AEIA-MN: Evaluating the Robustness of Multimodal LLM-Powered Mobile Agents Against Active Environmental Injection Attacks
As researchers continuously optimize AI agents to perform tasks more
effectively within operating systems, they often neglect to address the
critical need for enabling these agents to identify "impostors" within the
system. Through an analysis of the agents' operating environment, we identified
a potential threat: attackers can disguise their attack methods as
environmental elements, injecting active disturbances into the agents'
execution process, thereby disrupting their decision-making. We define this
type of attack as Active Environment Injection Attack (AEIA). Based on this, we
propose AEIA-MN, an active environment injection attack scheme that exploits
interaction vulnerabilities in the mobile operating system to evaluate the
robustness of MLLM-based agents against such threats. Experimental results show
that even advanced MLLMs are highly vulnerable to this attack, achieving a
maximum attack success rate of 93% in the AndroidWorld benchmark.
☆ Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction
Aspect sentiment quadruple prediction (ASQP) facilitates a detailed
understanding of opinions expressed in a text by identifying the opinion term,
aspect term, aspect category and sentiment polarity for each opinion. However,
annotating a full set of training examples to fine-tune models for ASQP is a
resource-intensive process. In this study, we explore the capabilities of large
language models (LLMs) for zero- and few-shot learning on the ASQP task across
five diverse datasets. We report F1 scores slightly below those obtained with
state-of-the-art fine-tuned models but exceeding previously reported zero- and
few-shot performance. In the 40-shot setting on the Rest16 restaurant domain
dataset, LLMs achieved an F1 score of 52.46, compared to 60.39 by the
best-performing fine-tuned method MVP. Additionally, we report the performance
of LLMs in target aspect sentiment detection (TASD), where the F1 scores were
also close to fine-tuned models, achieving 66.03 on Rest16 in the 40-shot
setting, compared to 72.76 with MVP. While human annotators remain essential
for achieving optimal performance, LLMs can reduce the need for extensive
manual annotation in ASQP tasks.
☆ Natural Language Generation from Visual Sequences: Challenges and Future Directions
The ability to use natural language to talk about visual content is at the
core of human intelligence and a crucial feature of any artificial intelligence
system. Various studies have focused on generating text for single images. In
contrast, comparatively little attention has been paid to exhaustively
analyzing and advancing work on multiple-image vision-to-text settings. In this
position paper, we claim that any task dealing with temporally ordered
sequences of multiple images or frames is an instance of a broader, more
general problem involving the understanding of intricate relationships between
the visual content and the corresponding text. We comprehensively analyze five
tasks that are instances of this problem and argue that they pose a common set
of challenges and share similarities in terms of modeling and evaluation
approaches. Based on the insights from these various aspects and stages of
multi-image-to-text generation, we highlight several open questions and suggest
future research directions. We believe that these directions can advance the
understanding of complex phenomena in this domain and the development of better
models.
☆ HPSS: Heuristic Prompting Strategy Search for LLM Evaluators
Bosi Wen, Pei Ke, Yufei Sun, Cunxiang Wang, Xiaotao Gu, Jinfeng Zhou, Jie Tang, Hongning Wang, Minlie Huang
Since the adoption of large language models (LLMs) for text evaluation has
become increasingly prevalent in the field of natural language processing
(NLP), a series of existing works attempt to optimize the prompts for LLM
evaluators to improve their alignment with human judgment. However, their
efforts are limited to optimizing individual factors of evaluation prompts,
such as evaluation criteria or output formats, neglecting the combinatorial
impact of multiple factors, which leads to insufficient optimization of the
evaluation pipeline. Nevertheless, identifying well-behaved prompting
strategies for adjusting multiple factors requires extensive enumeration. To
this end, we comprehensively integrate 8 key factors for evaluation prompts and
propose a novel automatic prompting strategy optimization method called
Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm,
HPSS conducts an iterative search to find well-behaved prompting strategies for
LLM evaluators. A heuristic function is employed to guide the search process,
enhancing the performance of our algorithm. Extensive experiments across four
evaluation tasks demonstrate the effectiveness of HPSS, consistently
outperforming both human-designed evaluation prompts and existing automatic
prompt optimization methods.
comment: 32 pages, 10 figures
☆ Whose story is it? Personalizing story generation by inferring author styles
Personalization has become essential for improving user experience in
interactive writing and educational applications, yet its potential in story
generation remains largely unexplored. In this work, we propose a novel
two-stage pipeline for personalized story generation. Our approach first infers
an author's implicit story-writing characteristics from their past work and
organizes them into an Author Writing Sheet, inspired by narrative theory. The
second stage uses this sheet to simulate the author's persona through tailored
persona descriptions and personalized story writing rules. To enable and
validate our approach, we construct Mythos, a dataset of 590 stories from 64
authors across five distinct sources that reflect diverse story-writing
settings. A head-to-head comparison with a non-personalized baseline
demonstrates our pipeline's effectiveness in generating high-quality
personalized stories. Our personalized stories achieve a 75 percent win rate
(versus 14 percent for the baseline and 11 percent ties) in capturing authors'
writing style based on their past works. Human evaluation highlights the high
quality of our Author Writing Sheet and provides valuable insights into the
personalized story generation task. Notable takeaways are that writings from
certain sources, such as Reddit, are easier to personalize than others, like
AO3, while narrative aspects, like Creativity and Language Use, are easier to
personalize than others, like Plot.
comment: preprint 52 pages
☆ Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks
We present an agentic, autonomous graph expansion framework that iteratively
structures and refines knowledge in situ. Unlike conventional knowledge graph
construction methods relying on static extraction or single-pass learning, our
approach couples a reasoning-native large language model with a continually
updated graph representation. At each step, the system actively generates new
concepts and relationships, merges them into a global graph, and formulates
subsequent prompts based on its evolving structure. Through this
feedback-driven loop, the model organizes information into a scale-free network
characterized by hub formation, stable modularity, and bridging nodes that link
disparate knowledge clusters. Over hundreds of iterations, new nodes and edges
continue to appear without saturating, while centrality measures and shortest
path distributions evolve to yield increasingly distributed connectivity. Our
analysis reveals emergent patterns, such as the rise of highly connected 'hub'
concepts and the shifting influence of 'bridge' nodes, indicating that agentic,
self-reinforcing graph construction can yield open-ended, coherent knowledge
structures. Applied to materials design problems, we present compositional
reasoning experiments by extracting node-specific and synergy-level principles
to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that
transcend rote summarization and strengthen the framework's potential for
open-ended scientific discovery. We discuss other applications in scientific
discovery and outline future directions for enhancing scalability and
interpretability.
☆ Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation
Despite the remarkable capabilities of Large Language Models (LLMs) in
various NLP tasks, they remain vulnerable to hallucinations due to their
limited parametric knowledge and lack of domain-specific expertise.
Retrieval-Augmented Generation (RAG) addresses this challenge by incorporating
external document retrieval to augment the knowledge base of LLMs. In this
approach, RAG retrieves document chunks from an external corpus in response to
a query, which are then used as context for the downstream language model to
generate an answer. However, these retrieved knowledge sources often include
irrelevant or erroneous information, undermining the effectiveness of RAG in
downstream tasks. To overcome this limitation, we introduce a compact,
efficient, and pluggable module designed to refine external knowledge sources
before feeding them to the generator. The module reconstructs retrieved content
by extracting the most relevant and supportive information and reorganising it
into a concise, query-specific format. Through a three-stage training paradigm
- comprising supervised fine-tuning, contrastive multi-task learning, and
reinforcement learning-based alignment - it prioritises critical knowledge and
aligns it with the generator's preferences. This method enables LLMs to produce
outputs that are more accurate, reliable, and contextually appropriate.
comment: 14 pages
☆ Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents
Role-Playing Agent (RPA) is an increasingly popular type of LLM Agent that
simulates human-like behaviors in a variety of tasks. However, evaluating RPAs
is challenging due to diverse task requirements and agent designs. This paper
proposes an evidence-based, actionable, and generalizable evaluation design
guideline for LLM-based RPA by systematically reviewing 1,676 papers published
between Jan. 2021 and Dec. 2024. Our analysis identifies six agent attributes,
seven task attributes, and seven evaluation metrics from existing literature.
Based on these findings, we present an RPA evaluation design guideline to help
researchers develop more systematic and consistent evaluation methods.
☆ Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge
Large Language Models (LLMs) have significantly advanced medical
question-answering by leveraging extensive clinical data and medical
literature. However, the rapid evolution of medical knowledge and the
labor-intensive process of manually updating domain-specific resources pose
challenges to the reliability of these systems. To address this, we introduce
Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates
the construction and continuous updating of medical knowledge graphs,
integrates reasoning, and retrieves current external evidence, such as PubMed
and WikiSearch. By dynamically linking new findings and complex medical
concepts, AMG-RAG not only improves accuracy but also enhances interpretability
in medical queries.
Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness
of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of
66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to
100 times larger. Notably, these improvements are achieved without increasing
computational overhead, highlighting the critical role of automated knowledge
graph generation and external evidence retrieval in delivering up-to-date,
trustworthy medical insights.
☆ Language Barriers: Evaluating Cross-Lingual Performance of CNN and Transformer Architectures for Speech Quality Estimation
Objective speech quality models aim to predict human-perceived speech quality
using automated methods. However, cross-lingual generalization remains a major
challenge, as Mean Opinion Scores (MOS) vary across languages due to
linguistic, perceptual, and dataset-specific differences. A model trained
primarily on English data may struggle to generalize to languages with
different phonetic, tonal, and prosodic characteristics, leading to
inconsistencies in objective assessments. This study investigates the
cross-lingual performance of two speech quality models: NISQA, a CNN-based
model, and a Transformer-based Audio Spectrogram Transformer (AST) model. Both
models were trained exclusively on English datasets containing over 49,000
speech samples and subsequently evaluated on speech in German, French,
Mandarin, Swedish, and Dutch. We analyze model performance using Pearson
Correlation Coefficient (PCC) and Root Mean Square Error (RMSE) across five
speech quality dimensions: coloration, discontinuity, loudness, noise, and MOS.
Our findings show that while AST achieves a more stable cross-lingual
performance, both models exhibit noticeable biases. Notably, Mandarin speech
quality predictions correlate highly with human MOS scores, whereas Swedish and
Dutch present greater prediction challenges. Discontinuities remain difficult
to model across all languages. These results highlight the need for more
balanced multilingual datasets and architecture-specific adaptations to improve
cross-lingual generalization.
☆ You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations
Meeting summarization suffers from limited high-quality data, mainly due to
privacy restrictions and expensive collection processes. We address this gap
with FAME, a dataset of 500 meetings in English and 300 in German produced by
MIMIC, our new multi-agent meeting synthesis framework that generates meeting
transcripts on a given knowledge source by defining psychologically grounded
participant profiles, outlining the conversation, and orchestrating a large
language model (LLM) debate. A modular post-processing step refines these
outputs, mitigating potential repetitiveness and overly formal tones, ensuring
coherent, credible dialogues at scale. We also propose a psychologically
grounded evaluation framework assessing naturalness, social behavior
authenticity, and transcript difficulties. Human assessments show that FAME
approximates real-meeting spontaneity (4.5/5 in naturalness), preserves
speaker-centric challenges (3/5 in spoken language), and introduces richer
information-oriented difficulty (4/5 in difficulty). These findings highlight
that FAME is a good and scalable proxy for real-world meeting conditions. It
enables new test scenarios for meeting summarization research and other
conversation-centric applications in tasks requiring conversation data or
simulating social scenarios under behavioral constraints.
☆ Eager Updates For Overlapped Communication and Computation in DiLoCo
Distributed optimization methods such as DiLoCo have been shown to be
effective in training very large models across multiple distributed workers,
such as datacenters. These methods split updates into two parts: an inner
optimization phase, where the workers independently execute multiple
optimization steps on their own local data, and an outer optimization step,
where the inner updates are synchronized. While such approaches require orders
of magnitude less communication than standard data-parallel training, in
settings where the workers are datacenters, even the limited communication
requirements of these approaches can still cause significant slow downs due to
the blocking necessary at each outer optimization step. In this paper, we
investigate techniques to mitigate this issue by overlapping communication with
computation in a manner that allows the outer optimization step to fully
overlap with the inner optimization phase. We show that a particular variant,
dubbed eager updates, provides competitive performance with standard DiLoCo in
settings with low bandwidth between workers.
comment: arXiv admin note: text overlap with arXiv:2501.18512
☆ B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability
Post-hoc explanation methods for black-box models often struggle with
faithfulness and human interpretability due to the lack of explainability in
current neural models. Meanwhile, B-cos networks have been introduced to
improve model explainability through architectural and computational
adaptations, but their application has so far been limited to computer vision
models and their associated training pipelines. In this work, we introduce
B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly
transforms pre-trained language models into B-cos LMs by combining B-cos
conversion and task fine-tuning, improving efficiency compared to previous
B-cos methods. Our automatic and human evaluation results demonstrate that
B-cos LMs produce more faithful and human interpretable explanations than post
hoc methods, while maintaining task performance comparable to conventional
fine-tuning. Our in-depth analysis explores how B-cos LMs differ from
conventionally fine-tuned models in their learning processes and explanation
patterns. Finally, we provide practical guidelines for effectively building
B-cos LMs based on our findings. Our code is available at
https://anonymous.4open.science/r/bcos_lm.
comment: 20 pages, 15 figures
☆ Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs
Previous approaches to persona simulation large language models (LLMs) have
typically relied on learning basic biographical information, or using limited
role-play dialogue datasets to capture a character's responses. However, a
holistic representation of an individual goes beyond surface-level facts or
conversations to deeper thoughts and thinking. In this work, we introduce
CharacterBot, a model designed to replicate both the linguistic patterns and
distinctive thought processes of a character. Using Lu Xun, a renowned Chinese
writer, as a case study, we propose four training tasks derived from his 17
essay collections. These include a pre-training task focused on mastering
external linguistic structures and knowledge, as well as three fine-tuning
tasks: multiple-choice question answering, generative question answering, and
style transfer, each aligning the LLM with Lu Xun's internal ideation and
writing style. To optimize learning across these tasks, we introduce a CharLoRA
parameter updating mechanism, where a general linguistic style expert
collaborates with other task-specific experts to better study both the language
style and the understanding of deeper thoughts. We evaluate CharacterBot on
three tasks for linguistic accuracy and opinion comprehension, demonstrating
that it significantly outperforms the baselines on our adapted metrics. We hope
that this work inspires future research on deep character persona simulation
LLM.
comment: 19 pages, 3 figures
☆ Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin
Sailor2 is a family of cutting-edge multilingual language models for
South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit
diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous
pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to
support 13 SEA languages while retaining proficiency in Chinese and English.
Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA
languages. We also deliver a comprehensive cookbook on how to develop the
multilingual model in an efficient manner, including five key aspects: data
curation, pre-training, post-training, model customization and evaluation. We
hope that Sailor2 model (Apache 2.0 license) will drive language development in
the SEA region, and Sailor2 cookbook will inspire researchers to build more
inclusive LLMs for other under-served languages.
comment: 49 pages, 16 figures. Technical Report of Sailor2:
https://sea-sailor.github.io/blog/sailor2/
☆ Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
The reasoning abilities of Large Language Models (LLMs) have demonstrated
remarkable advancement and exceptional performance across diverse domains.
However, leveraging these reasoning capabilities to enhance LLM safety against
adversarial attacks and jailbreak queries remains largely unexplored. To bridge
this gap, we propose Reasoning-to-Defend (R2D), a novel training paradigm that
integrates safety reflections of queries and responses into LLMs' generation
process, unlocking a safety-aware reasoning mechanism. This approach enables
self-evaluation at each reasoning step to create safety pivot tokens as
indicators of the response's safety status. Furthermore, in order to improve
the learning efficiency of pivot token prediction, we propose Contrastive Pivot
Optimization(CPO), which enhances the model's ability to perceive the safety
status of dialogues. Through this mechanism, LLMs dynamically adjust their
response strategies during reasoning, significantly enhancing their defense
capabilities against jailbreak attacks. Extensive experimental results
demonstrate that R2D effectively mitigates various attacks and improves overall
safety, highlighting the substantial potential of safety-aware reasoning in
strengthening LLMs' robustness against jailbreaks.
comment: 18 pages
☆ A Survey of Text Classification Under Class Distribution Shift
Adriana Valentina Costache, Silviu Florin Gheorghe, Eduard Gabriel Poesina, Paul Irofti, Radu Tudor Ionescu
The basic underlying assumption of machine learning (ML) models is that the
training and test data are sampled from the same distribution. However, in
daily practice, this assumption is often broken, i.e.~the distribution of the
test data changes over time, which hinders the application of conventional ML
models. One domain where the distribution shift naturally occurs is text
classification, since people always find new topics to discuss. To this end, we
survey research articles studying open-set text classification and related
tasks. We divide the methods in this area based on the constraints that define
the kind of distribution shift and the corresponding problem formulation,
i.e.~learning with the Universum, zero-shot learning, and open-set learning. We
next discuss the predominant mitigation approaches for each problem setup.
Finally, we identify several future work directions, aiming to push the
boundaries beyond the state of the art. Interestingly, we find that continual
learning can solve many of the issues caused by the shifting class
distribution. We maintain a list of relevant papers at
https://github.com/Eduard6421/Open-Set-Survey.
☆ Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs
Large Language Models (LLMs) often generate outputs that lack grounding in
real-world facts, a phenomenon known as hallucinations. Prior research has
associated hallucinations with model uncertainty, leveraging this relationship
for hallucination detection and mitigation. In this paper, we challenge the
underlying assumption that all hallucinations are associated with uncertainty.
Using knowledge detection and uncertainty measurement methods, we demonstrate
that models can hallucinate with high certainty even when they have the correct
knowledge. We further show that high-certainty hallucinations are consistent
across models and datasets, distinctive enough to be singled out, and challenge
existing mitigation methods. Our findings reveal an overlooked aspect of
hallucinations, emphasizing the need to understand their origins and improve
mitigation strategies to enhance LLM safety. The code is available at
https://github.com/technion-cs-nlp/Trust_me_Im_wrong .
☆ Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing
Limited by the context window size of Large Language Models(LLMs), handling
various tasks with input tokens exceeding the upper limit has been challenging,
whether it is a simple direct retrieval task or a complex multi-hop reasoning
task. Although various methods have been proposed to enhance the long-context
processing capabilities of LLMs, they either incur substantial post-training
costs, or require additional tool modules(e.g.,RAG), or have not shown
significant improvement in realistic tasks. Our work observes the correlation
between the attention distribution and generated answers across each layer, and
establishes the attention allocation aligns with retrieval-augmented
capabilities through experiments. Drawing on the above insights, we propose a
novel method InfiniRetri that leverages the LLMs's own attention information to
enable accurate retrieval across inputs of infinitely length. Our evaluations
indicate that InfiniRetri achieves 100% accuracy in the
Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model,
surpassing other method or larger models and setting a new
state-of-the-art(SOTA). Moreover, our method achieves significant performance
improvements on real-world benchmarks, with a maximum 288% improvement. In
addition, InfiniRetri can be applied to any Transformer-based LLMs without
additional training and substantially reduces inference latency and compute
overhead in long texts. In summary, our comprehensive studies show
InfiniRetri's potential for practical applications and creates a paradigm for
retrievaling information using LLMs own capabilities under infinite-length
tokens. Code will be released in link.
comment: 21 pages
☆ Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger
Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, Yong Liu
Large language models (LLMs) have shown remarkable emergent capabilities,
transforming the execution of functional tasks by leveraging external tools for
complex problems that require specialized processing or real-time data. While
existing research expands LLMs access to diverse tools (e.g., program
interpreters, search engines, weather/map apps), the necessity of using these
tools is often overlooked, leading to indiscriminate tool invocation. This
naive approach raises two key issues:(1) increased delays due to unnecessary
tool calls, and (2) potential errors resulting from faulty interactions with
external tools. In this paper, we introduce meta-cognition as a proxy for LLMs
self-assessment of their capabilities, representing the model's awareness of
its own limitations. Based on this, we propose MeCo, an adaptive
decision-making strategy for external tool use. MeCo quantifies metacognitive
scores by capturing high-level cognitive signals in the representation space,
guiding when to invoke tools. Notably, MeCo is fine-tuning-free and incurs
minimal cost. Our experiments show that MeCo accurately detects LLMs' internal
cognitive signals and significantly improves tool-use decision-making across
multiple base models and benchmarks.
☆ AlignFreeze: Navigating the Impact of Realignment on the Layers of Multilingual Models Across Diverse Languages NAACL 2025
Realignment techniques are often employed to enhance cross-lingual transfer
in multilingual language models, still, they can sometimes degrade performance
in languages that differ significantly from the fine-tuned source language.
This paper introduces AlignFreeze, a method that freezes either the layers'
lower half or upper half during realignment. Through controlled experiments on
4 tasks, 3 models, and in 35 languages, we find that realignment affects all
the layers but can be the most detrimental to the lower ones. Freezing the
lower layers can prevent performance degradation. Particularly, AlignFreeze
improves Part-of-Speech (PoS) tagging performances in languages where full
realignment fails: with XLM-R, it provides improvements of more than one
standard deviation in accuracy in seven more languages than full realignment.
comment: 24 pages, 2 figures, to be published in Proceedings of NAACL 2025
☆ Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text
Masked language modeling has become a widely adopted unsupervised technique
to pre-train language models. However, the process of selecting tokens for
masking is random, and the percentage of masked tokens is typically fixed for
the entire training process. In this paper, we propose to adjust the masking
ratio and to decide which tokens to mask based on a novel task-informed
anti-curriculum learning scheme. First, we harness task-specific knowledge
about useful and harmful tokens in order to determine which tokens to mask.
Second, we propose a cyclic decaying masking ratio, which corresponds to an
anti-curriculum schedule (from hard to easy). We exemplify our novel
task-informed anti-curriculum by masking (TIACBM) approach across three diverse
downstream tasks: sentiment analysis, text classification by topic, and
authorship attribution. Our findings suggest that TIACBM enhances the ability
of the model to focus on key task-relevant features, contributing to
statistically significant performance gains across tasks. We release our code
at https://github.com/JarcaAndrei/TIACBM.
☆ Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models
With the emergence of Mixture-of-Experts (MoE), the efficient scaling of
model size has accelerated the development of large language models in recent
years. However, their high memory requirements prevent their use in
resource-constrained environments. While knowledge distillation (KD) has been a
proven method for model compression, its application to MoE teacher models
remains underexplored. Through our investigation, we discover that
non-activated experts in MoE models possess valuable knowledge that benefits
student models. We further demonstrate that existing KD methods are not optimal
for compressing MoE models, as they fail to leverage this knowledge
effectively. To address this, we propose two intuitive MoE-specific KD methods
for the first time: Knowledge Augmentation (KA) and Student-Aware Router (SAR),
both designed to effectively extract knowledge from all experts. Specifically,
KA augments knowledge by sampling experts multiple times, while SAR uses all
experts and adjusts the expert weights through router training to provide
optimal knowledge. Extensive experiments show that our methods outperform
conventional KD methods, demonstrating their effectiveness for MoE teacher
models.
☆ LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation
Popular Micro-videos, dominant on platforms like TikTok and YouTube, hold
significant commercial value. The rise of high-quality AI-generated content has
spurred interest in AI-driven micro-video creation. However, despite the
advanced capabilities of large language models (LLMs) like ChatGPT and DeepSeek
in text generation and reasoning, their potential to assist the creation of
popular micro-videos remains largely unexplored.
In this paper, we conduct an empirical study on LLM-assisted popular
micro-video generation (LLMPopcorn). Specifically, we investigate the following
research questions: (i) How can LLMs be effectively utilized to assist popular
micro-video generation? (ii) To what extent can prompt-based enhancements
optimize the LLM-generated content for higher popularity? (iii) How well do
various LLMs and video generators perform in the popular micro-video generation
task? By exploring these questions, we show that advanced LLMs like DeepSeek-V3
enable micro-video generation to achieve popularity comparable to human-created
content. Prompt enhancements further boost popularity, and benchmarking
highlights DeepSeek-V3 and DeepSeek-R1 among LLMs, while LTX-Video and
HunyuanVideo lead in video generation. This pioneering work advances
AI-assisted micro-video creation, uncovering new research opportunities. We
will release the code and datasets to support future studies.
☆ Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages
Quantifying reasoning capability in low-resource languages remains a
challenge in NLP due to data scarcity and limited access to annotators. While
LLM-assisted dataset construction has proven useful for medium- and
high-resource languages, its effectiveness in low-resource languages,
particularly for commonsense reasoning, is still unclear. In this paper, we
compare three dataset creation strategies: (1) LLM-assisted dataset generation,
(2) machine translation, and (3) human-written data by native speakers, to
build a culturally nuanced story comprehension dataset. We focus on Javanese
and Sundanese, two major local languages in Indonesia, and evaluate the
effectiveness of open-weight and closed-weight LLMs in assisting dataset
creation through extensive manual validation. To assess the utility of
synthetic data, we fine-tune language models on classification and generation
tasks using this data and evaluate performance on a human-written test set. Our
findings indicate that LLM-assisted data creation outperforms machine
translation.
comment: 18 pages total: 8 pages of main body, 6 pages of appendix. 4 figures
in main body, 6 figures in appendix. Submitted to ARR on February 2025
☆ Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options
We present a novel reasoning approach called Flow-of-Options (FoO), designed
to address intrinsic biases in Large Language Models (LLMs). FoO enables LLMs
to systematically explore a diverse range of possibilities in their reasoning,
as demonstrated by an FoO-based agentic system for autonomously solving Machine
Learning tasks (AutoML). Our framework outperforms state-of-the-art baselines,
achieving improvements of 38.2% - 69.2% on standard data science tasks, and
37.4% - 47.9% on therapeutic chemistry tasks. With an overall operation cost
under $1 per task, our framework is well-suited for cost-sensitive
applications. Beyond classification and regression, we illustrate the broader
applicability of our FoO-based agentic system to tasks such as reinforcement
learning and image generation. Our framework presents significant advancements
compared to current state-of-the-art agentic systems for AutoML, due to the
benefits of FoO in enforcing diversity in LLM solutions through compressed,
explainable representations that also support long-term memory when combined
with case-based reasoning.
comment: Github code: https://github.com/flagshippioneering/Flow-of-Options
☆ Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts
Leiyu Pan, Zhenpeng Su, Minxuan Lv, Yizhe Xiong, Xiangwen Zhang, Zijia Lin, Hui Chen, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Deyi Xiong
Large language models have demonstrated exceptional performance across a wide
range of tasks. However, dense models usually suffer from sparse activation,
where many activation values tend towards zero (i.e., being inactivated). We
argue that this could restrict the efficient exploration of model
representation space. To mitigate this issue, we propose Finedeep, a
deep-layered fine-grained expert architecture for dense models. Our framework
partitions the feed-forward neural network layers of traditional dense models
into small experts, arranges them across multiple sub-layers. A novel routing
mechanism is proposed to determine each expert's contribution. We conduct
extensive experiments across various model sizes, demonstrating that our
approach significantly outperforms traditional dense architectures in terms of
perplexity and benchmark performance while maintaining a comparable number of
parameters and floating-point operations. Moreover, we find that Finedeep
achieves optimal results when balancing depth and width, specifically by
adjusting the number of expert sub-layers and the number of experts per
sub-layer. Empirical results confirm that Finedeep effectively alleviates
sparse activation and efficiently utilizes representation capacity in dense
models.
☆ SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems
Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay, Johannes Bjerva
Providing high-quality feedback is crucial for student success but is
constrained by time, cost, and limited data availability. We introduce
Synthetic Educational Feedback Loops (SEFL), a novel framework designed to
deliver immediate, on-demand feedback at scale without relying on extensive,
real-world student data. In SEFL, two large language models (LLMs) operate in
teacher--student roles to simulate assignment completion and formative
feedback, generating abundant synthetic pairs of student work and corresponding
critiques. We then fine-tune smaller, more computationally efficient LLMs on
these synthetic pairs, enabling them to replicate key features of high-quality,
goal-oriented feedback. Unlike personalized tutoring approaches that offer
multi-turn, individualized instruction, SEFL specifically focuses on
replicating the teacher-->student feedback loop for diverse assignments.
Through both LLM-as-a-judge and human evaluations, we demonstrate that
SEFL-tuned models outperform their non-tuned counterparts in feedback quality,
clarity, and timeliness. These findings reveal SEFL's potential to transform
feedback processes for higher education and beyond, offering an ethical and
scalable alternative to conventional manual feedback cycles.
☆ Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data
Code-switching (CS) is still a critical challenge in Natural Language
Processing (NLP). Current Large Language Models (LLMs) struggle to interpret
and generate code-switched text, primarily due to the scarcity of large-scale
CS datasets for training. This paper presents a novel methodology to generate
CS data using LLMs, and test it on the English-Spanish language pair. We
propose back-translating natural CS sentences into monolingual English, and
using the resulting parallel corpus to fine-tune LLMs to turn monolingual
sentences into CS. Unlike previous approaches to CS generation, our methodology
uses natural CS data as a starting point, allowing models to learn its natural
distribution beyond grammatical patterns. We thoroughly analyse the models'
performance through a study on human preferences, a qualitative error analysis
and an evaluation with popular automatic metrics. Results show that our
methodology generates fluent code-switched text, expanding research
opportunities in CS communication, and that traditional metrics do not
correlate with human judgement when assessing the quality of the generated CS
data. We release our code and generated dataset under a CC-BY-NC-SA license.
☆ On-Device LLMs for Home Assistant: Dual Role in Intent Detection and Response Generation
This paper investigates whether Large Language Models (LLMs), fine-tuned on
synthetic but domain-representative data, can perform the twofold task of (i)
slot and intent detection and (ii) natural language response generation for a
smart home assistant, while running solely on resource-limited, CPU-only edge
hardware. We fine-tune LLMs to produce both JSON action calls and text
responses. Our experiments show that 16-bit and 8-bit quantized variants
preserve high accuracy on slot and intent detection and maintain strong
semantic coherence in generated text, while the 4-bit model, while retaining
generative fluency, suffers a noticeable drop in device-service classification
accuracy. Further evaluations on noisy human (non-synthetic) prompts and
out-of-domain intents confirm the models' generalization ability, obtaining
around 80--86\% accuracy. While the average inference time is 5--6 seconds per
query -- acceptable for one-shot commands but suboptimal for multi-turn
dialogue -- our results affirm that an on-device LLM can effectively unify
command interpretation and flexible response generation for home automation
without relying on specialized hardware.
☆ Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison
Query-driven recommendation with unknown items poses a challenge for users to
understand why certain items are appropriate for their needs. Query-driven
Contrastive Summarization (QCS) is a methodology designed to address this issue
by leveraging language-based item descriptions to clarify contrasts between
them. However, existing state-of-the-art contrastive summarization methods such
as STRUM-LLM fall short of this goal. To overcome these limitations, we
introduce Q-STRUM Debate, a novel extension of STRUM-LLM that employs
debate-style prompting to generate focused and contrastive summarizations of
item aspects relevant to a query. Leveraging modern large language models
(LLMs) as powerful tools for generating debates, Q-STRUM Debate provides
enhanced contrastive summaries. Experiments across three datasets demonstrate
that Q-STRUM Debate yields significant performance improvements over existing
methods on key contrastive summarization criteria, thus introducing a novel and
performant debate prompting methodology for QCS.
☆ GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning
Large Language Models (LLMs) fine-tuning technologies have achieved
remarkable results. However, traditional LLM fine-tuning approaches face
significant challenges: they require large Floating Point (FP) computation,
raising privacy concerns when handling sensitive data, and are impractical for
resource-constrained edge devices. While Parameter-Efficient Fine-Tuning (PEFT)
techniques reduce trainable parameters, their reliance on floating-point
arithmetic creates fundamental incompatibilities with edge hardware. In this
work, we introduce a novel framework for on-device LLM fine-tuning that
eliminates the need for floating-point operations in both inference and
training, named GSQ-Tuning. At its core is the Group-Shared Exponents Integer
format, which efficiently represents model parameters in integer format using
shared exponents among parameter groups. When combined with LoRA-like adapters,
this enables fully integer-based fine-tuning that is both memory and compute
efficient. We demonstrate that our approach achieves accuracy comparable to
FP16-based fine-tuning while significantly reducing memory usage (50%).
Moreover, compared to FP8, our method can reduce 5x power consumption and 11x
chip area with same performance, making large-scale model adaptation feasible
on edge devices.
☆ Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation
Generating SQLs from user queries is a long-standing challenge, where the
accuracy of initial schema linking significantly impacts subsequent SQL
generation performance. However, current schema linking models still struggle
with missing relevant schema elements or an excess of redundant ones. A crucial
reason for this is that commonly used metrics, recall and precision, fail to
capture relevant element missing and thus cannot reflect actual schema linking
performance. Motivated by this, we propose an enhanced schema linking metric by
introducing a restricted missing indicator. Accordingly, we introduce Knapsack
optimization-based Schema Linking Agent (KaSLA), a plug-in schema linking agent
designed to prevent the missing of relevant schema elements while minimizing
the inclusion of redundant ones. KaSLA employs a hierarchical linking strategy
that first identifies the optimal table linking and subsequently links columns
within the selected table to reduce linking candidate space. In each linking
process, it utilize a knapsack optimization approach to link potentially
relevant elements while accounting for a limited tolerance of potential
redundant ones.With this optimization, KaSLA-1.6B achieves superior schema
linking results compared to large-scale LLMs, including deepseek-v3 with
state-of-the-art (SOTA) schema linking method. Extensive experiments on Spider
and BIRD benchmarks verify that KaSLA can significantly improve the SQL
generation performance of SOTA text-to-SQL models by substituting their schema
linking processes.
☆ Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements
Shu Yang, Shenzhe Zhu, Zeyu Wu, Keyu Wang, Junchi Yao, Junchao Wu, Lijie Hu, Mengdi Li, Derek F. Wong, Di Wang
We introduce Fraud-R1, a benchmark designed to evaluate LLMs' ability to
defend against internet fraud and phishing in dynamic, real-world scenarios.
Fraud-R1 comprises 8,564 fraud cases sourced from phishing scams, fake job
postings, social media, and news, categorized into 5 major fraud types. Unlike
previous benchmarks, Fraud-R1 introduces a multi-round evaluation pipeline to
assess LLMs' resistance to fraud at different stages, including credibility
building, urgency creation, and emotional manipulation. Furthermore, we
evaluate 15 LLMs under two settings: 1. Helpful-Assistant, where the LLM
provides general decision-making assistance, and 2. Role-play, where the model
assumes a specific persona, widely used in real-world agent-based interactions.
Our evaluation reveals the significant challenges in defending against fraud
and phishing inducement, especially in role-play settings and fake job
postings. Additionally, we observe a substantial performance gap between
Chinese and English, underscoring the need for improved multilingual fraud
detection capabilities.
☆ Soundwave: Less is More for Speech-Text Alignment in LLMs
Existing end-to-end speech large language models (LLMs) usually rely on
large-scale annotated data for training, while data-efficient training has not
been discussed in depth. We focus on two fundamental problems between speech
and text: the representation space gap and sequence length inconsistency. We
propose Soundwave, which utilizes an efficient training strategy and a novel
architecture to address these issues. Results show that Soundwave outperforms
the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks,
using only one-fiftieth of the training data. Further analysis shows that
Soundwave still retains its intelligence during conversation. The project is
available at https://github.com/FreedomIntelligence/Soundwave.
☆ None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks
In LLM evaluations, reasoning is often distinguished from recall/memorization
by performing numerical variations to math-oriented questions. Here we
introduce a general variation method for multiple-choice questions that
completely dissociates the correct answer from previously seen tokens or
concepts, requiring LLMs to understand and reason (rather than memorizing) in
order to answer correctly. Using this method, we evaluate state-of-the-art
proprietary and open-source LLMs on two datasets available in English and
Spanish: the public MMLU benchmark and the private UNED-Access 2024 dataset.
Results show that all models experience remarkable accuracy drops under our
proposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access
2024, ranging from 10% to 93% across models. Notably, the most accurate model
in our experimentation (OpenAI-o3-mini) is not the most robust
(DeepSeek-R1-70B), suggesting that the best models in standard evaluations may
not be the ones with better reasoning capabilities. Also, we see larger
accuracy drops in public (vs private) datasets and questions posed in their
original language (vs a manual translation), which are signs of contamination
and also point to a relevant role of recall/memorization in current LLMs'
answers.
☆ Multilingual European Language Models: Benchmarking Approaches and Challenges
The breakthrough of generative large language models (LLMs) that can solve
different tasks through chat interaction has led to a significant increase in
the use of general benchmarks to assess the quality or performance of these
models beyond individual applications. There is also a need for better methods
to evaluate and also to compare models due to the ever increasing number of new
models published. However, most of the established benchmarks revolve around
the English language. This paper analyses the benefits and limitations of
current evaluation datasets, focusing on multilingual European benchmarks. We
analyse seven multilingual benchmarks and identify four major challenges.
Furthermore, we discuss potential solutions to enhance translation quality and
mitigate cultural biases, including human-in-the-loop verification and
iterative translation ranking. Our analysis highlights the need for culturally
aware and rigorously validated benchmarks to assess the reasoning and
question-answering capabilities of multilingual LLMs accurately.
☆ H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking
Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Da-Cheng Juan, Hai Li, Yiran Chen
Large Reasoning Models (LRMs) have recently extended their powerful reasoning
capabilities to safety checks-using chain-of-thought reasoning to decide
whether a request should be answered. While this new approach offers a
promising route for balancing model utility and safety, its robustness remains
underexplored. To address this gap, we introduce Malicious-Educator, a
benchmark that disguises extremely dangerous or malicious requests beneath
seemingly legitimate educational prompts. Our experiments reveal severe
security flaws in popular commercial-grade LRMs, including OpenAI o1/o3,
DeepSeek-R1, and Gemini 2.0 Flash Thinking. For instance, although OpenAI's o1
model initially maintains a high refusal rate of about 98%, subsequent model
updates significantly compromise its safety; and attackers can easily extract
criminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any
additional tricks. To further highlight these vulnerabilities, we propose
Hijacking Chain-of-Thought (H-CoT), a universal and transferable attack method
that leverages the model's own displayed intermediate reasoning to jailbreak
its safety reasoning mechanism. Under H-CoT, refusal rates sharply
decline-dropping from 98% to below 2%-and, in some instances, even transform
initially cautious tones into ones that are willing to provide harmful content.
We hope these findings underscore the urgent need for more robust safety
mechanisms to preserve the benefits of advanced reasoning capabilities without
compromising ethical standards.
☆ Are Multilingual Language Models an Off-ramp for Under-resourced Languages? Will we arrive at Digital Language Equality in Europe in 2030?
Large language models (LLMs) demonstrate unprecedented capabilities and
define the state of the art for almost all natural language processing (NLP)
tasks and also for essentially all Language Technology (LT) applications. LLMs
can only be trained for languages for which a sufficient amount of pre-training
data is available, effectively excluding many languages that are typically
characterised as under-resourced. However, there is both circumstantial and
empirical evidence that multilingual LLMs, which have been trained using data
sets that cover multiple languages (including under-resourced ones), do exhibit
strong capabilities for some of these under-resourced languages. Eventually,
this approach may have the potential to be a technological off-ramp for those
under-resourced languages for which "native" LLMs, and LLM-based technologies,
cannot be developed due to a lack of training data. This paper, which
concentrates on European languages, examines this idea, analyses the current
situation in terms of technology support and summarises related work. The
article concludes by focusing on the key open questions that need to be
answered for the approach to be put into practice in a systematic way.
☆ How desirable is alignment between LLMs and linguistically diverse human users?
We discuss how desirable it is that Large Language Models (LLMs) be able to
adapt or align their language behavior with users who may be diverse in their
language use. User diversity may come about among others due to i) age
differences; ii) gender characteristics, and/or iii) multilingual experience,
and associated differences in language processing and use. We consider
potential consequences for usability, communication, and LLM development.
☆ PAFT: Prompt-Agnostic Fine-Tuning
While Large Language Models (LLMs) adapt well to downstream tasks after
fine-tuning, this adaptability often compromises prompt robustness, as even
minor prompt variations can significantly degrade performance. To address this,
we propose Prompt-Agnostic Fine-Tuning(PAFT), a simple yet effective approach
that dynamically adjusts prompts during fine-tuning. This encourages the model
to learn underlying task principles rather than overfitting to specific prompt
formulations. PAFT operates in two stages: First, a diverse set of meaningful,
synthetic candidate prompts is constructed. Second, during fine-tuning, prompts
are randomly sampled from this set to create dynamic training inputs. Extensive
experiments across diverse datasets and LLMs demonstrate that models trained
with PAFT exhibit strong robustness and generalization across a wide range of
prompts, including unseen ones. This enhanced robustness improves both model
performance and inference speed while maintaining training efficiency. Ablation
studies further confirm the effectiveness of PAFT.
comment: 20 pages, 6 figures
☆ Rejected Dialects: Biases Against African American Language in Reward Models NAACL
Preference alignment via reward models helps build safe, helpful, and
reliable large language models (LLMs). However, subjectivity in preference
judgments and the lack of representative sampling in preference data collection
can introduce new biases, hindering reward models' fairness and equity. In this
work, we introduce a framework for evaluating dialect biases in reward models
and conduct a case study on biases against African American Language (AAL)
through several experiments comparing reward model preferences and behavior on
paired White Mainstream English (WME) and both machine-translated and
human-written AAL corpora. We show that reward models are less aligned with
human preferences when processing AAL texts vs. WME ones (-4\% accuracy on
average), frequently disprefer AAL-aligned texts vs. WME-aligned ones, and
steer conversations toward WME, even when prompted with AAL texts. Our findings
provide a targeted analysis of anti-AAL biases at a relatively understudied
stage in LLM development, highlighting representational harms and ethical
questions about the desired behavior of LLMs concerning AAL.
comment: Accepted to NAACL Findings 2025
☆ Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models
While large models pre-trained on high-quality data exhibit excellent
performance across various reasoning tasks, including mathematical reasoning
(e.g. GSM8k, MultiArith), specializing smaller models to excel at mathematical
reasoning remains a challenging problem. Common approaches to address this
challenge include knowledge distillation, where smaller student models learn
from large pre-trained teacher models, and data augmentation, such as
rephrasing questions. Despite these efforts, smaller models struggle with
arithmetic computations, leading to errors in mathematical reasoning. In this
work, we focus on leveraging a programmatically generated arithmetic dataset to
enhance the reasoning capabilities of smaller models. We investigate two key
approaches to incorporate this dataset -- (1) intermediate fine-tuning, where a
model is fine-tuned on the arithmetic dataset before being trained on a
reasoning dataset, and (2) integrating the arithmetic dataset into the
instruction-tuning mixture, allowing the model to learn arithmetic skills
alongside general instruction-following abilities. Our experiments on multiple
reasoning benchmarks demonstrate that incorporating an arithmetic dataset,
whether through targeted fine-tuning or within the instruction-tuning mixture,
enhances the models' arithmetic capabilities, which in turn improves their
mathematical reasoning performance.
comment: Preprint
☆ S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
Recent studies have demonstrated the effectiveness of LLM test-time scaling.
However, existing approaches to incentivize LLMs' deep thinking abilities
generally require large-scale data or significant training efforts. Meanwhile,
it remains unclear how to improve the thinking abilities of less powerful base
models. In this work, we introduce S$^2$R, an efficient framework that enhances
LLM reasoning by teaching models to self-verify and self-correct during
inference. Specifically, we first initialize LLMs with iterative
self-verification and self-correction behaviors through supervised fine-tuning
on carefully curated data. The self-verification and self-correction skills are
then further strengthened by both outcome-level and process-level reinforcement
learning, with minimized resource requirements, enabling the model to
adaptively refine its reasoning process during inference. Our results
demonstrate that, with only 3.1k self-verifying and self-correcting behavior
initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from
51.0\% to 81.6\%, outperforming models trained on an equivalent amount of
long-CoT distilled data. Extensive experiments and analysis based on three base
models across both in-domain and out-of-domain benchmarks validate the
effectiveness of S$^2$R. Our code and data are available at
https://github.com/NineAbyss/S2R.
☆ MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching
Existing multilingual vision-language (VL) benchmarks often only cover a
handful of languages. Consequently, evaluations of large vision-language models
(LVLMs) predominantly target high-resource languages, underscoring the need for
evaluation data for low-resource languages. To address this limitation, we
introduce MVL-SIB, a massively multilingual vision-language benchmark that
evaluates both cross-modal and text-only topical matching across 205 languages
-- over 100 more than the most multilingual existing VL benchmarks encompass.
We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini)
on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic
matching in lower-resource languages, performing no better than chance on
languages like N'Koo. Our analysis further reveals that VL support in LVLMs
declines disproportionately relative to textual support for lower-resource
languages, as evidenced by comparison of cross-modal and text-only topical
matching performance. We further observe that open-weight LVLMs do not benefit
from representing a topic with more than one image, suggesting that these
models are not yet fully effective at handling multi-image tasks. By
correlating performance on MVL-SIB with other multilingual VL benchmarks, we
highlight that MVL-SIB serves as a comprehensive probe of multilingual VL
understanding in LVLMs.
☆ MeMo: Towards Language Models with Associative Memory Mechanisms
Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli
Memorization is a fundamental ability of Transformer-based Large Language
Models, achieved through learning. In this paper, we propose a paradigm shift
by designing an architecture to memorize text directly, bearing in mind the
principle that memorization precedes learning. We introduce MeMo, a novel
architecture for language modeling that explicitly memorizes sequences of
tokens in layered associative memories. By design, MeMo offers transparency and
the possibility of model editing, including forgetting texts. We experimented
with the MeMo architecture, showing the memorization power of the one-layer and
the multi-layer configurations.
☆ Towards Equitable AI: Detecting Bias in Using Large Language Models for Marketing
The recent advances in large language models (LLMs) have revolutionized
industries such as finance, marketing, and customer service by enabling
sophisticated natural language processing tasks. However, the broad adoption of
LLMs brings significant challenges, particularly in the form of social biases
that can be embedded within their outputs. Biases related to gender, age, and
other sensitive attributes can lead to unfair treatment, raising ethical
concerns and risking both company reputation and customer trust. This study
examined bias in finance-related marketing slogans generated by LLMs (i.e.,
ChatGPT) by prompting tailored ads targeting five demographic categories:
gender, marital status, age, income level, and education level. A total of
1,700 slogans were generated for 17 unique demographic groups, and key terms
were categorized into four thematic groups: empowerment, financial, benefits
and features, and personalization. Bias was systematically assessed using
relative bias calculations and statistically tested with the Kolmogorov-Smirnov
(KS) test against general slogans generated for any individual. Results
revealed that marketing slogans are not neutral; rather, they emphasize
different themes based on demographic factors. Women, younger individuals,
low-income earners, and those with lower education levels receive more distinct
messaging compared to older, higher-income, and highly educated individuals.
This underscores the need to consider demographic-based biases in AI-generated
marketing strategies and their broader societal implications. The findings of
this study provide a roadmap for developing more equitable AI systems,
highlighting the need for ongoing bias detection and mitigation efforts in
LLMs.
☆ An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation
Large language models (LLMs) are revolutionizing healthcare by improving
diagnosis, patient care, and decision support through interactive
communication. More recently, they have been applied to analyzing physiological
time-series like wearable data for health insight extraction. Existing methods
embed raw numerical sequences directly into prompts, which exceeds token limits
and increases computational costs. Additionally, some studies integrated
features extracted from time-series in textual prompts or applied multimodal
approaches. However, these methods often produce generic and unreliable outputs
due to LLMs' limited analytical rigor and inefficiency in interpreting
continuous waveforms. In this paper, we develop an LLM-powered agent for
physiological time-series analysis aimed to bridge the gap in integrating LLMs
with well-established analytical tools. Built on the OpenCHA, an open-source
LLM-powered framework, our agent features an orchestrator that integrates user
interaction, data sources, and analytical tools to generate accurate health
insights. To evaluate its effectiveness, we implement a case study on heart
rate (HR) estimation from Photoplethysmogram (PPG) signals using a dataset of
PPG and Electrocardiogram (ECG) recordings in a remote health monitoring study.
The agent's performance is benchmarked against OpenAI GPT-4o-mini and GPT-4o,
with ECG serving as the gold standard for HR estimation. Results demonstrate
that our agent significantly outperforms benchmark models by achieving lower
error rates and more reliable HR estimations. The agent implementation is
publicly available on GitHub.
☆ Subword models struggle with word learning, but surprisal hides it
We study word learning in subword and character language models with the
psycholinguistic lexical decision task. While subword LMs struggle to discern
words and non-words with high accuracy, character LMs solve this task easily
and consistently. Furthermore, when comparing word learning and syntactic
learning, both processes are separable in character LM where word learning
predates syntactic learning, whereas these processes are simultaneous in
subword LM. This raises questions about the adequacy of subword LMs for
modeling language acquisition and positions character LMs as a viable
alternative.
comment: 12 pages
☆ KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan
Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan, Jonibek Mansurov, Maiya Goloburda, Akhmed Sakip, Zhuohan Xie, Yuxia Wang, Bekassyl Syzdykov, Nurkhan Laiyk, Alham Fikri Aji, Ekaterina Kochmar, Preslav Nakov, Fajri Koto
Despite having a population of twenty million, Kazakhstan's culture and
language remain underrepresented in the field of natural language processing.
Although large language models (LLMs) continue to advance worldwide, progress
in Kazakh language has been limited, as seen in the scarcity of dedicated
models and benchmark evaluations. To address this gap, we introduce KazMMLU,
the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU
comprises 23,000 questions that cover various educational levels, including
STEM, humanities, and social sciences, sourced from authentic educational
materials and manually validated by native speakers and educators. The dataset
includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting
Kazakhstan's bilingual education system and rich local context. Our evaluation
of several state-of-the-art multilingual models (Llama-3.1, Qwen-2.5, GPT-4,
and DeepSeek V3) demonstrates substantial room for improvement, as even the
best-performing models struggle to achieve competitive performance in Kazakh
and Russian. These findings underscore significant performance gaps compared to
high-resource languages. We hope that our dataset will enable further research
and development of Kazakh-centric LLMs. Data and code will be made available
upon acceptance.
☆ Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models
When encountering increasingly frequent performance improvements or cost
reductions from a new large language model (LLM), developers of applications
leveraging LLMs must decide whether to take advantage of these improvements or
stay with older tried-and-tested models. Low perceived switching frictions can
lead to choices that do not consider more subtle behavior changes that the
transition may induce. Our experiments use a popular game-theoretic behavioral
economics model of trust to show stark differences in the trusting behavior of
OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust
behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing
and risk-seeking with future returns from trust, and contrast it with
DeepSeek's more sophisticated and profitable trusting behavior that stems from
an ability to incorporate deeper concepts like forward planning and
theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our
results highlight the perils of relying on LLM performance benchmarks that are
too narrowly defined and suggest that careful analysis of their hidden fault
lines should be part of any organization's AI strategy.
☆ Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models
Inverse tasks can uncover potential reasoning gaps as Large Language Models
(LLMs) scale up. In this work, we explore the redefinition task, in which we
assign alternative values to well-known physical constants and units of
measure, prompting LLMs to respond accordingly. Our findings show that not only
does model performance degrade with scale, but its false confidence also rises.
Moreover, while factors such as prompting strategies or response formatting are
influential, they do not preclude LLMs from anchoring to memorized values.
☆ Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models
In this study, we explore the application of Large Language Models (LLMs) for
generating synthetic users and simulating user conversations with a
task-oriented dialogue system and present detailed results and their analysis.
We propose a comprehensive novel approach to user simulation technique that
uses LLMs to create diverse user profiles, set goals, engage in multi-turn
dialogues, and evaluate the conversation success. We employ two proprietary
LLMs, namely GPT-4o and GPT-o1 (Achiam et al., 2023), to generate a
heterogeneous base of user profiles, characterized by varied demographics,
multiple user goals, different conversational styles, initial knowledge levels,
interests, and conversational objectives. We perform a detailed analysis of the
user profiles generated by LLMs to assess the diversity, consistency, and
potential biases inherent in these LLM-generated user simulations. We find that
GPT-o1 generates more heterogeneous user distribution across most user
attributes, while GPT-4o generates more skewed user attributes. The generated
set of user profiles are then utilized to simulate dialogue sessions by
interacting with a task-oriented dialogue system.
☆ Towards Text-Image Interleaved Retrieval
Xin Zhang, Ziqi Dai, Yongqi Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Jun Yu, Wenjie Li, Min Zhang
Current multimodal information retrieval studies mainly focus on single-image
inputs, which limits real-world applications involving multiple images and
text-image interleaved content. In this work, we introduce the text-image
interleaved retrieval (TIIR) task, where the query and document are interleaved
text-image sequences, and the model is required to understand the semantics
from the interleaved context for effective retrieval. We construct a TIIR
benchmark based on naturally interleaved wikiHow tutorials, where a specific
pipeline is designed to generate interleaved queries. To explore the task, we
adapt several off-the-shelf retrievers and build a dense baseline by
interleaved multimodal large language model (MLLM). We then propose a novel
Matryoshka Multimodal Embedder (MME), which compresses the number of visual
tokens at different granularity, to address the challenge of excessive visual
tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption
of existing models does not consistently yield effective results. Our MME
achieves significant improvements over the baseline by substantially fewer
visual tokens. We provide extensive analysis and will release the dataset and
code to facilitate future research.
comment: 16 pages, 14 figures
☆ Commonsense Reasoning in Arab Culture
Abdelrahman Sadallah, Junior Cedric Tonga, Khalid Almubarak, Saeed Almheiri, Farah Atif, Chatrine Qwaider, Karima Kadaoui, Sara Shatnawi, Yaser Alesh, Fajri Koto
Despite progress in Arabic large language models, such as Jais and AceGPT,
their evaluation on commonsense reasoning has largely relied on
machine-translated datasets, which lack cultural depth and may introduce
Anglocentric biases. Commonsense reasoning is shaped by geographical and
cultural contexts, and existing English datasets fail to capture the diversity
of the Arab world. To address this, we introduce \datasetname, a commonsense
reasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13
countries across the Gulf, Levant, North Africa, and the Nile Valley. The
dataset was built from scratch by engaging native speakers to write and
validate culturally relevant questions for their respective countries.
\datasetname spans 12 daily life domains with 54 fine-grained subtopics,
reflecting various aspects of social norms, traditions, and everyday
experiences. Zero-shot evaluations show that open-weight language models with
up to 32B parameters struggle to comprehend diverse Arab cultures, with
performance varying across regions. These findings highlight the need for more
culturally aware models and datasets tailored to the Arabic-speaking world.
☆ Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach
Self-supervised language and audio models effectively predict brain responses
to speech. However, traditional prediction models rely on linear mappings from
unimodal features, despite the complex integration of auditory signals with
linguistic and semantic information across widespread brain networks during
speech comprehension. Here, we introduce a nonlinear, multimodal prediction
model that combines audio and linguistic features from pre-trained models
(e.g., LLAMA, Whisper). Our approach achieves a 17.2% and 17.9% improvement in
prediction performance (unnormalized and normalized correlation) over
traditional unimodal linear models, as well as a 7.7% and 14.4% improvement,
respectively, over prior state-of-the-art models. These improvements represent
a major step towards future robust in-silico testing and improved decoding
performance. They also reveal how auditory and semantic information are fused
in motor, somatosensory, and higher-level semantic regions, aligning with
existing neurolinguistic theories. Overall, our work highlights the often
neglected potential of nonlinear and multimodal approaches to brain modeling,
paving the way for future studies to embrace these strategies in naturalistic
neurolinguistics research.
☆ How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
In the age of misinformation, hallucination -- the tendency of Large Language
Models (LLMs) to generate non-factual or unfaithful responses -- represents the
main risk for their global utility. Despite LLMs becoming increasingly
multilingual, the vast majority of research on detecting and quantifying LLM
hallucination are (a) English-centric and (b) focus on machine translation (MT)
and summarization, tasks that are less common ``in the wild'' than open
information seeking. In contrast, we aim to quantify the extent of LLM
hallucination across languages in knowledge-intensive long-form question
answering. To this end, we train a multilingual hallucination detection model
and conduct a large-scale study across 30 languages and 6 open-source LLM
families. We start from an English hallucination detection dataset and rely on
MT to generate (noisy) training data in other languages. We also manually
annotate gold data for five high-resource languages; we then demonstrate, for
these languages, that the estimates of hallucination rates are similar between
silver (LLM-generated) and gold test sets, validating the use of silver data
for estimating hallucination rates for other languages. For the final rates
estimation, we build a knowledge-intensive QA dataset for 30 languages with
LLM-generated prompts and Wikipedia articles as references. We find that, while
LLMs generate longer responses with more hallucinated tokens for
higher-resource languages, there is no correlation between length-normalized
hallucination rates of languages and their digital representation. Further, we
find that smaller LLMs exhibit larger hallucination rates than larger models.
comment: Under Review
☆ R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs
Recent studies have combined Large Language Models (LLMs) with Knowledge
Graphs (KGs) to enhance reasoning, improving inference accuracy without
additional training while mitigating hallucination. However, existing
frameworks are often rigid, struggling to adapt to KG or task changes. They
also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning.
To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that
separates reasoning into two roles: an Operator (a low-capacity LLM) that
gathers evidence and a Supervisor (a high-capacity LLM) that makes final
judgments. This design is cost-efficient for LLM inference while still
maintaining strong reasoning accuracy. Additionally, R2-KG employs an
Abstention mechanism, generating answers only when sufficient evidence is
collected from KG, which significantly enhances reliability. Experiments across
multiple KG-based reasoning tasks show that R2-KG consistently outperforms
baselines in both accuracy and reliability, regardless of the inherent
capability of LLMs used as the Operator. Further experiments reveal that the
single-agent version of R2-KG, equipped with a strict self-consistency
strategy, achieves significantly higher-than-baseline reliability while
reducing inference cost. However, it also leads to a higher abstention rate in
complex KGs. Our findings establish R2-KG as a flexible and cost-effective
solution for KG-based reasoning. It reduces reliance on high-capacity LLMs
while ensuring trustworthy inference.
☆ Efficient Machine Translation Corpus Generation: Integrating Human-in-the-Loop Post-Editing with Large Language Models
This paper introduces an advanced methodology for machine translation (MT)
corpus generation, integrating semi-automated, human-in-the-loop post-editing
with large language models (LLMs) to enhance efficiency and translation
quality. Building upon previous work that utilized real-time training of a
custom MT quality estimation metric, this system incorporates novel LLM
features such as Enhanced Translation Synthesis and Assisted Annotation
Analysis, which improve initial translation hypotheses and quality assessments,
respectively. Additionally, the system employs LLM-Driven Pseudo Labeling and a
Translation Recommendation System to reduce human annotator workload in
specific contexts. These improvements not only retain the original benefits of
cost reduction and enhanced post-edit quality but also open new avenues for
leveraging cutting-edge LLM advancements. The project's source code is
available for community use, promoting collaborative developments in the field.
The demo video can be accessed here.
☆ MediaMind: Revolutionizing Media Monitoring using Agentification
In an era of rapid technological advancements, agentification of software
tools has emerged as a critical innovation, enabling systems to function
autonomously and adaptively. This paper introduces MediaMind as a case study to
demonstrate the agentification process, highlighting how existing software can
be transformed into intelligent agents capable of independent decision-making
and dynamic interaction. Developed by aiXplain, MediaMind leverages agent-based
architecture to autonomously monitor, analyze, and provide insights from
multilingual media content in real time. The focus of this paper is on the
technical methodologies and design principles behind agentifying MediaMind,
showcasing how agentification enhances adaptability, efficiency, and
responsiveness. Through detailed case studies and practical examples, we
illustrate how the agentification of MediaMind empowers organizations to
streamline workflows, optimize decision-making, and respond to evolving trends.
This work underscores the broader potential of agentification to revolutionize
software tools across various domains.
☆ Self-Enhanced Reasoning Training: Activating Latent Reasoning in Small Models for Enhanced Reasoning Distillation ICASSP 2025
Yong Zhang, Bingyuan Zhang, Zhitao Li, Ming Li, Ning Cheng, Minchuan Chen, Tao Wei, Jun Ma, Shaojun Wang, Jing Xiao
The rapid advancement of large language models (LLMs) has significantly
enhanced their reasoning abilities, enabling increasingly complex tasks.
However, these capabilities often diminish in smaller, more computationally
efficient models like GPT-2. Recent research shows that reasoning distillation
can help small models acquire reasoning capabilities, but most existing methods
focus primarily on improving teacher-generated reasoning paths. Our
observations reveal that small models can generate high-quality reasoning paths
during sampling, even without chain-of-thought prompting, though these paths
are often latent due to their low probability under standard decoding
strategies. To address this, we propose Self-Enhanced Reasoning Training
(SERT), which activates and leverages latent reasoning capabilities in small
models through self-training on filtered, self-generated reasoning paths under
zero-shot conditions. Experiments using OpenAI's GPT-3.5 as the teacher model
and GPT-2 models as the student models demonstrate that SERT enhances the
reasoning abilities of small models, improving their performance in reasoning
distillation.
comment: Accepted by the 50th IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP 2025)
☆ "I know myself better, but not really greatly": Using LLMs to Detect and Explain LLM-Generated Texts
Jiazhou Ji, Jie Guo, Weidong Qiu, Zheng Huang, Yang Xu, Xinru Lu, Xiaoyu Jiang, Ruizhe Li, Shujun Li
Large language models (LLMs) have demonstrated impressive capabilities in
generating human-like texts, but the potential misuse of such LLM-generated
texts raises the need to distinguish between human-generated and LLM-generated
content. This paper explores the detection and explanation capabilities of
LLM-based detectors of LLM-generated texts, in the context of a binary
classification task (human-generated texts vs LLM-generated texts) and a
ternary classification task (human-generated texts, LLM-generated texts, and
undecided). By evaluating on six close/open-source LLMs with different sizes,
our findings reveal that while self-detection consistently outperforms
cross-detection, i.e., LLMs can detect texts generated by themselves more
accurately than those generated by other LLMs, the performance of
self-detection is still far from ideal, indicating that further improvements
are needed. We also show that extending the binary to the ternary
classification task with a new class "Undecided" can enhance both detection
accuracy and explanation quality, with improvements being statistically
significant and consistent across all LLMs. We finally conducted comprehensive
qualitative and quantitative analyses on the explanation errors, which are
categorized into three types: reliance on inaccurate features (the most
frequent error), hallucinations, and incorrect reasoning. These findings with
our human-annotated dataset emphasize the need for further research into
improving both self-detection and self-explanation, particularly to address
overfitting issues that may hinder generalization.
comment: Under review
☆ Beyond Seen Data: Improving KBQA Generalization Through Schema-Guided Logical Form Generation
Knowledge base question answering (KBQA) aims to answer user questions in
natural language using rich human knowledge stored in large KBs. As current
KBQA methods struggle with unseen knowledge base elements at test time,we
introduce SG-KBQA: a novel model that injects schema contexts into entity
retrieval and logical form generation to tackle this issue. It uses the richer
semantics and awareness of the knowledge base structure provided by schema
contexts to enhance generalizability. We show that SG-KBQA achieves strong
generalizability, outperforming state-of-the-art models on two commonly used
benchmark datasets across a variety of test settings. Code will be released
upon paper publication.
comment: 17 pages
☆ Iron Sharpens Iron: Defending Against Attacks in Machine-Generated Text Detection with Adversarial Training ACL 2025
Machine-generated Text (MGT) detection is crucial for regulating and
attributing online texts. While the existing MGT detectors achieve strong
performance, they remain vulnerable to simple perturbations and adversarial
attacks. To build an effective defense against malicious perturbations, we view
MGT detection from a threat modeling perspective, that is, analyzing the
model's vulnerability from an adversary's point of view and exploring effective
mitigations. To this end, we introduce an adversarial framework for training a
robust MGT detector, named GREedy Adversary PromoTed DefendER (GREATER). The
GREATER consists of two key components: an adversary GREATER-A and a detector
GREATER-D. The GREATER-D learns to defend against the adversarial attack from
GREATER-A and generalizes the defense to other attacks. GREATER-A identifies
and perturbs the critical tokens in embedding space, along with greedy search
and pruning to generate stealthy and disruptive adversarial examples. Besides,
we update the GREATER-A and GREATER-D synchronously, encouraging the GREATER-D
to generalize its defense to different attacks and varying attack intensities.
Our experimental results across 9 text perturbation strategies and 5
adversarial attacks show that our GREATER-D reduces the Attack Success Rate
(ASR) by 10.61% compared with SOTA defense methods while our GREATER-A is
demonstrated to be more effective and efficient than SOTA attack approaches.
comment: Submitted to ACL 2025, Preprint, Under review
☆ Playing with Voices: Tabletop Role-Playing Game Recordings as a Diarization Challenge NAACL
This paper provides a proof of concept that audio of tabletop role-playing
games (TTRPG) could serve as a challenge for diarization systems. TTRPGs are
carried out mostly by conversation. Participants often alter their voices to
indicate that they are talking as a fictional character. Audio processing
systems are susceptible to voice conversion with or without technological
assistance. TTRPG present a conversational phenomenon in which voice conversion
is an inherent characteristic for an immersive gaming experience. This could
make it more challenging for diarizers to pick the real speaker and determine
that impersonating is just that. We present the creation of a small TTRPG audio
dataset and compare it against the AMI and the ICSI corpus. The performance of
two diarizers, pyannote.audio and wespeaker, were evaluated. We observed that
TTRPGs' properties result in a higher confusion rate for both diarizers.
Additionally, wespeaker strongly underestimates the number of speakers in the
TTRPG audio files. We propose TTRPG audio as a promising challenge for
diarization systems.
comment: 15 pages, 14 figures, published in NAACL Findings 2025
☆ Translate Smart, not Hard: Cascaded Translation Systems with Quality-Aware Deferral
Larger models often outperform smaller ones but come with high computational
costs. Cascading offers a potential solution. By default, it uses smaller
models and defers only some instances to larger, more powerful models. However,
designing effective deferral rules remains a challenge. In this paper, we
propose a simple yet effective approach for machine translation, using existing
quality estimation (QE) metrics as deferral rules. We show that QE-based
deferral allows a cascaded system to match the performance of a larger model
while invoking it for a small fraction (30% to 50%) of the examples,
significantly reducing computational costs. We validate this approach through
both automatic and human evaluation.
comment: Preprint
☆ Multi-Novelty: Improve the Diversity and Novelty of Contents Generated by Large Language Models via inference-time Multi-Views Brainstorming
Large Language Models (LLMs) demonstrate remarkable proficiency in generating
accurate and fluent text. However, they often struggle with diversity and
novelty, leading to repetitive or overly deterministic responses. These
limitations stem from constraints in training data, including gaps in specific
knowledge domains, outdated information, and an over-reliance on textual
sources. Such shortcomings reduce their effectiveness in tasks requiring
creativity, multi-perspective reasoning, and exploratory thinking, such as LLM
based AI scientist agents and creative artist agents . To address this
challenge, we introduce inference-time multi-view brainstorming method, a novel
approach that enriches input prompts with diverse perspectives derived from
both textual and visual sources, which we refere to as "Multi-Novelty". By
incorporating additional contextual information as diverse starting point for
chain of thoughts, this method enhances the variety and creativity of generated
outputs. Importantly, our approach is model-agnostic, requiring no
architectural modifications and being compatible with both open-source and
proprietary LLMs.
☆ Theoretical Guarantees for Minimum Bayes Risk Decoding
Minimum Bayes Risk (MBR) decoding optimizes output selection by maximizing
the expected utility value of an underlying human distribution. While prior
work has shown the effectiveness of MBR decoding through empirical evaluation,
few studies have analytically investigated why the method is effective. As a
result of our analysis, we show that, given the size $n$ of the reference
hypothesis set used in computation, MBR decoding approaches the optimal
solution with high probability at a rate of $O\left(n^{-\frac{1}{2}}\right)$,
under certain assumptions, even though the language space $Y$ is significantly
larger $Y\gg n$. This result helps to theoretically explain the strong
performance observed in several prior empirical studies on MBR decoding. In
addition, we provide the performance gap for maximum-a-posteriori (MAP)
decoding and compare it to MBR decoding. The result of this paper indicates
that MBR decoding tends to converge to the optimal solution faster than MAP
decoding in several cases.
☆ Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees NeurIPS
Reinforcement Learning from Human Feedback (RLHF) has been highly successful
in aligning large language models with human preferences. While prevalent
methods like DPO have demonstrated strong performance, they frame interactions
with the language model as a bandit problem, which limits their applicability
in real-world scenarios where multi-turn conversations are common.
Additionally, DPO relies on the Bradley-Terry model assumption, which does not
adequately capture the non-transitive nature of human preferences. In this
paper, we address these challenges by modeling the alignment problem as a
two-player constant-sum Markov game, where each player seeks to maximize their
winning rate against the other across all steps of the conversation. Our
approach Multi-step Preference Optimization (MPO) is built upon the natural
actor-critic framework~\citep{peters2008natural}. We further develop OMPO based
on the optimistic online gradient descent
algorithm~\citep{rakhlin2013online,joulani17a}. Theoretically, we provide a
rigorous analysis for both algorithms on convergence and show that OMPO
requires $\mathcal{O}(\epsilon^{-1})$ policy updates to converge to an
$\epsilon$-approximate Nash equilibrium. We also validate the effectiveness of
our method on multi-turn conversations dataset and math reasoning dataset.
comment: Accepted as oral presentation in NeurIPS LanGame Workshop, revised
from ICLR submission
☆ Speech-FT: A Fine-tuning Strategy for Enhancing Speech Representation Models Without Compromising Generalization Ability
Speech representation models are highly effective at extracting general
features for various tasks. While fine-tuning can enhance these representations
for specific applications, it often compromises their generalization ability.
To address this challenge, we propose Speech-FT, a fine-tuning strategy for
speech representation models that leverages model merging to preserve
generalization ability while still benefiting from fine-tuning. Speech-FT is
effective across different fine-tuning scenarios and is compatible with various
types of speech representation models, providing a versatile solution.
Speech-FT offers an efficient and practical approach to further improving
general speech representations after pre-training.
☆ Baichuan-M1: Pushing the Medical Capability of Large Language Models
Bingning Wang, Haizhou Zhao, Huozhi Zhou, Liang Song, Mingyu Xu, Wei Cheng, Xiangrong Zeng, Yupeng Zhang, Yuqi Huo, Zecheng Wang, Zhengyun Zhao, Da Pan, Fan Yang, Fei Kou, Fei Li, Fuzhong Chen, Guosheng Dong, Han Liu, Hongda Zhang, Jin He, Jinjie Yang, Kangxi Wu, Kegeng Wu, Lei Su, Linlin Niu, Linzhuang Sun, Mang Wang, Pengcheng Fan, Qianli Shen, Rihui Xin, Shunya Dang, Songchi Zhou, Weipeng Chen, Wenjing Luo, Xin Chen, Xin Men, Xionghai Lin, Xuezhen Dong, Yan Zhang, Yifei Duan, Yuyan Zhou, Zhi Ma, Zhiying Wu
The current generation of large language models (LLMs) is typically designed
for broad, general-purpose applications, while domain-specific LLMs, especially
in vertical fields like medicine, remain relatively scarce. In particular, the
development of highly efficient and practical LLMs for the medical domain is
challenging due to the complexity of medical knowledge and the limited
availability of high-quality data. To bridge this gap, we introduce
Baichuan-M1, a series of large language models specifically optimized for
medical applications. Unlike traditional approaches that simply continue
pretraining on existing models or apply post-training to a general base model,
Baichuan-M1 is trained from scratch with a dedicated focus on enhancing medical
capabilities. Our model is trained on 20 trillion tokens and incorporates a
range of effective training methods that strike a balance between general
capabilities and medical expertise. As a result, Baichuan-M1 not only performs
strongly across general domains such as mathematics and coding but also excels
in specialized medical fields. We have open-sourced Baichuan-M1-14B, a mini
version of our model, which can be accessed through the following links.
comment: 33 pages, technical report
☆ Evaluation of Best-of-N Sampling Strategies for Language Model Alignment
Best-of-N (BoN) sampling with a reward model has been shown to be an
effective strategy for aligning Large Language Models (LLMs) with human
preferences at the time of decoding. BoN sampling is susceptible to a problem
known as reward hacking. Since the reward model is an imperfect proxy for the
true objective, an excessive focus on optimizing its value can lead to a
compromise of its performance on the true objective. Previous work proposes
Regularized BoN sampling (RBoN), a BoN sampling with regularization to the
objective, and shows that it outperforms BoN sampling so that it mitigates
reward hacking and empirically (Jinnai et al., 2024). However, Jinnai et al.
(2024) introduce RBoN based on a heuristic and they lack the analysis of why
such regularization strategy improves the performance of BoN sampling. The aim
of this study is to analyze the effect of BoN sampling on regularization
strategies. Using the regularization strategies corresponds to robust
optimization, which maximizes the worst case over a set of possible
perturbations in the proxy reward. Although the theoretical guarantees are not
directly applicable to RBoN, RBoN corresponds to a practical implementation.
This paper proposes an extension of the RBoN framework, called Stochastic RBoN
sampling (SRBoN), which is a theoretically guaranteed approach to worst-case
RBoN in proxy reward. We then perform an empirical evaluation using the
AlpacaFarm and Anthropic's hh-rlhf datasets to evaluate which factors of the
regularization strategies contribute to the improvement of the true proxy
reward. In addition, we also propose another simple RBoN method, the Sentence
Length Regularized BoN, which has a better performance in the experiment as
compared to the previous methods.
☆ A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization
Junhui He, Junna Xing, Nan Wang, Rui Xu, Shangyu Wu, Peng Zhou, Qiang Liu, Chun Jason Xue, Qingan Li
Long context large language models (LLMs) pose significant challenges for
efficient serving due to the large memory footprint and high access overhead of
KV cache. Retrieval-based KV cache reduction methods can mitigate these
challenges, typically by offloading the complete KV cache to CPU and retrieving
necessary tokens on demand during inference. However, these methods still
suffer from unsatisfactory accuracy degradation and extra retrieval overhead.
To address these limitations, this paper proposes A$^2$ATS, a novel
retrieval-based KV cache reduction method. A$^2$ATS aims to obtain an accurate
approximation of attention scores by applying the vector quantization technique
to key states, thereby enabling efficient and precise retrieval of the top-K
tokens. First, we propose Windowed Rotary Position Embedding, which decouples
the positional dependency from query and key states after position embedding.
Then, we propose query-aware vector quantization that optimizes the objective
of attention score approximation directly. Finally, we design the heterogeneous
inference architecture for KV cache offloading, enabling long context serving
with larger batch sizes. Experimental results demonstrate that A$^2$ATS can
achieve a lower performance degradation with similar or lower overhead compared
to existing methods, thereby increasing long context serving throughput by up
to $2.7 \times$.
☆ Demystifying Multilingual Chain-of-Thought in Process Reward Modeling
Large language models (LLMs) are designed to perform a wide range of tasks.
To improve their ability to solve complex problems requiring multi-step
reasoning, recent research leverages process reward modeling to provide
fine-grained feedback at each step of the reasoning process for reinforcement
learning (RL), but it predominantly focuses on English. In this paper, we
tackle the critical challenge of extending process reward models (PRMs) to
multilingual settings. To achieve this, we train multilingual PRMs on a dataset
spanning seven languages, which is translated from English. Through
comprehensive evaluations on two widely used reasoning benchmarks across 11
languages, we demonstrate that multilingual PRMs not only improve average
accuracy but also reduce early-stage reasoning errors. Furthermore, our results
highlight the sensitivity of multilingual PRMs to both the number of training
languages and the volume of English data, while also uncovering the benefits
arising from more candidate responses and trainable parameters. This work opens
promising avenues for robust multilingual applications in complex, multi-step
reasoning tasks. In addition, we release the code to foster research along this
line.
☆ R.R.: Unveiling LLM Training Privacy through Recollection and Ranking
Large Language Models (LLMs) pose significant privacy risks, potentially
leaking training data due to implicit memorization. Existing privacy attacks
primarily focus on membership inference attacks (MIAs) or data extraction
attacks, but reconstructing specific personally identifiable information (PII)
in LLM's training data remains challenging. In this paper, we propose R.R.
(Recollect and Rank), a novel two-step privacy stealing attack that enables
attackers to reconstruct PII entities from scrubbed training data where the PII
entities have been masked. In the first stage, we introduce a prompt paradigm
named recollection, which instructs the LLM to repeat a masked text but fill in
masks. Then we can use PII identifiers to extract recollected PII candidates.
In the second stage, we design a new criterion to score each PII candidate and
rank them. Motivated by membership inference, we leverage the reference model
as a calibration to our criterion. Experiments across three popular PII
datasets demonstrate that the R.R. achieves better PII identical performance
compared to baselines. These results highlight the vulnerability of LLMs to PII
leakage even when training data has been scrubbed. We release the replicate
package of R.R. at a link.
comment: 13 pages, 9 figures
☆ \textit{One Size doesn't Fit All}: A Personalized Conversational Tutoring Agent for Mathematics Instruction
Large language models (LLMs) have been increasingly employed in various
intelligent educational systems, simulating human tutors to facilitate
effective human-machine interaction. However, previous studies often overlook
the significance of recognizing and adapting to individual learner
characteristics. Such adaptation is crucial for enhancing student engagement
and learning efficiency, particularly in mathematics instruction, where diverse
learning styles require personalized strategies to promote comprehension and
enthusiasm. In this paper, we propose a \textbf{P}erson\textbf{A}lized
\textbf{C}onversational tutoring ag\textbf{E}nt (PACE) for mathematics
instruction. PACE simulates students' learning styles based on the Felder and
Silverman learning style model, aligning with each student's persona. In this
way, our PACE can effectively assess the personality of students, allowing to
develop individualized teaching strategies that resonate with their unique
learning styles. To further enhance students' comprehension, PACE employs the
Socratic teaching method to provide instant feedback and encourage deep
thinking. By constructing personalized teaching data and training models, PACE
demonstrates the ability to identify and adapt to the unique needs of each
student, significantly improving the overall learning experience and outcomes.
Moreover, we establish multi-aspect evaluation criteria and conduct extensive
analysis to assess the performance of personalized teaching. Experimental
results demonstrate the superiority of our model in personalizing the
educational experience and motivating students compared to existing methods.
♻ ☆ Scaling Test-Time Compute Without Verification or RL is Suboptimal
Despite substantial advances in scaling test-time compute, an ongoing debate
in the community is how it should be scaled up to enable continued and
efficient improvements with scaling. There are largely two approaches: first,
distilling successful search or thinking traces; and second, using verification
(e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement
learning (RL) and search algorithms. In this paper, we prove that finetuning
LLMs with verifier-based (VB) methods based on RL or search is far superior to
verifier-free (VF) approaches based on distilling or cloning search traces,
given a fixed amount of compute/data budget. Further, we show that as we scale
test-time compute (measured as the output token length) and training data,
suboptimality of VF methods scales poorly compared to VB when the base
pre-trained LLM presents a heterogeneous distribution over correct solution
traces (e.g., different lengths, styles, etc.) and admits a non-sharp
distribution over rewards on traces sampled from it. We formalize this
condition using anti-concentration [Erd\H{o}s, 1945]. This implies a stronger
result that VB methods scale better asymptotically, with the performance gap
between VB and VF methods widening as test-time budget grows. We corroborate
our theory empirically on both didactic and math reasoning problems with
3/8/32B-sized pre-trained LLMs, where we find verification is crucial for
scaling test-time compute.
♻ ☆ Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness
Bryan Li, Fiona Luo, Samar Haider, Adwait Agashe, Tammy Li, Runqi Liu, Muqing Miao, Shriya Ramakrishnan, Yuan Yuan, Chris Callison-Burch
The paradigm of retrieval-augmented generated (RAG) helps mitigate
hallucinations of large language models (LLMs). However, RAG also introduces
biases contained within the retrieved documents. These biases can be amplified
in scenarios which are multilingual and culturally-sensitive, such as
territorial disputes. In this paper, we introduce BordIRLines, a benchmark
consisting of 720 territorial dispute queries paired with 14k Wikipedia
documents across 49 languages. To evaluate LLMs' cross-lingual robustness for
this task, we formalize several modes for multilingual retrieval. Our
experiments on several LLMs reveal that retrieving multilingual documents best
improves response consistency and decreases geopolitical bias over using purely
in-language documents, showing how incorporating diverse perspectives improves
robustness. Also, querying in low-resource languages displays a much wider
variance in the linguistic distribution of response citations. Our further
experiments and case studies investigate how cross-lingual RAG is affected by
aspects from IR to document contents. We release our benchmark and code to
support further research towards ensuring equitable information access across
languages at https://huggingface.co/datasets/borderlines/bordirlines.
♻ ☆ Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model
Emre Can Acikgoz, Jeremiah Greer, Akul Datta, Ze Yang, William Zeng, Oussama Elachqar, Emmanouil Koukoumidis, Dilek Hakkani-Tür, Gokhan Tur
Large Language Models (LLMs) with API-calling capabilities enabled building
effective Language Agents (LA), while also revolutionizing the conventional
task-oriented dialogue (TOD) paradigm. However, current approaches face a
critical dilemma: TOD systems are often trained on a limited set of target
APIs, requiring new data to maintain their quality when interfacing with new
services, while LAs are not trained to maintain user intent over multi-turn
conversations. Because both robust multi-turn management and advanced function
calling are crucial for effective conversational agents, we evaluate these
skills on three popular benchmarks: MultiWOZ 2.4 (TOD), BFCL V3 (LA), and
API-Bank (LA), and our analyses reveal that specialized approaches excel in one
domain but underperform in the other. To bridge this chasm, we introduce CoALM
(Conversational Agentic Language Model), a unified approach that integrates
both conversational and agentic capabilities. We created CoALM-IT, a carefully
constructed multi-task dataset that interleave multi-turn ReAct reasoning with
complex API usage. Using CoALM-IT, we train three models CoALM 8B, CoALM 70B,
and CoALM 405B, which outperform top domain-specific models, including GPT-4o,
across all three benchmarks.This demonstrates the feasibility of a single model
approach for both TOD and LA, setting a new standard for conversational agents.
♻ ☆ Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023 ACL 2025
Ting-Yao E. Hsu, Yi-Li Hsu, Shaurya Rohatgi, Chieh-Yang Huang, Ho Yin Sam Ng, Ryan Rossi, Sungchul Kim, Tong Yu, Lun-Wei Ku, C. Lee Giles, Ting-Hao K. Huang
Since the SciCap datasets launch in 2021, the research community has made
significant progress in generating captions for scientific figures in scholarly
articles. In 2023, the first SciCap Challenge took place, inviting global teams
to use an expanded SciCap dataset to develop models for captioning diverse
figure types across various academic fields. At the same time, text generation
models advanced quickly, with many powerful pre-trained large multimodal models
(LMMs) emerging that showed impressive capabilities in various
vision-and-language tasks. This paper presents an overview of the first SciCap
Challenge and details the performance of various models on its data, capturing
a snapshot of the fields state. We found that professional editors
overwhelmingly preferred figure captions generated by GPT-4V over those from
all other models and even the original captions written by authors. Following
this key finding, we conducted detailed analyses to answer this question: Have
advanced LMMs solved the task of generating captions for scientific figures?
comment: Accepted to TACL 2025
♻ ☆ Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection
Jailbreaking techniques trick Large Language Models (LLMs) into producing
restricted outputs, posing a serious threat. One line of defense is to use
another LLM as a Judge to evaluate the harmfulness of generated text. However,
we reveal that these Judge LLMs are vulnerable to token segmentation bias, an
issue that arises when delimiters alter the tokenization process, splitting
words into smaller sub-tokens. This disrupts the embeddings of the entire
sequence, reducing detection accuracy and allowing harmful content to be
misclassified as safe. In this paper, we introduce Emoji Attack, a novel
strategy that amplifies existing jailbreak prompts by exploiting token
segmentation bias. Our method leverages in-context learning to systematically
insert emojis into text before it is evaluated by a Judge LLM, inducing
embedding distortions that significantly lower the likelihood of detecting
unsafe content. Unlike traditional delimiters, emojis also introduce semantic
ambiguity, making them particularly effective in this attack. Through
experiments on state-of-the-art Judge LLMs, we demonstrate that Emoji Attack
substantially reduces the "unsafe" prediction rate, bypassing existing
safeguards.
♻ ☆ Media Slant is Contagious
This paper examines the diffusion of media slant. We document the influence
of Fox News Channel (FNC) on the partisan slant of local newspapers in the U.S.
over the years 1995-2008. We measure the political slant of local newspapers by
scaling the news article texts to Republicans' and Democrats' speeches in
Congress. Using channel positioning as an instrument for viewership, we find
that higher FNC viewership causes local newspapers to adopt more right-wing
slant. The effect emerges gradually, only several years after FNC's
introduction, mirroring the channel's growing influence on voting behavior. A
main driver of the shift in newspaper slant appears to be a change in local
political preferences.
♻ ☆ Explainable Fake News Detection With Large Language Model via Defense Among Competing Wisdom WWW'2024
Most fake news detection methods learn latent feature representations based
on neural networks, which makes them black boxes to classify a piece of news
without giving any justification. Existing explainable systems generate
veracity justifications from investigative journalism, which suffer from
debunking delayed and low efficiency. Recent studies simply assume that the
justification is equivalent to the majority opinions expressed in the wisdom of
crowds. However, the opinions typically contain some inaccurate or biased
information since the wisdom of crowds is uncensored. To detect fake news from
a sea of diverse, crowded and even competing narratives, in this paper, we
propose a novel defense-based explainable fake news detection framework.
Specifically, we first propose an evidence extraction module to split the
wisdom of crowds into two competing parties and respectively detect salient
evidences. To gain concise insights from evidences, we then design a
prompt-based module that utilizes a large language model to generate
justifications by inferring reasons towards two possible veracities. Finally,
we propose a defense-based inference module to determine veracity via modeling
the defense among these justifications. Extensive experiments conducted on two
real-world benchmarks demonstrate that our proposed method outperforms
state-of-the-art baselines in terms of fake news detection and provides
high-quality justifications.
comment: 12 pages, WWW'2024
♻ ☆ DeepSeek-V3 Technical Report
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W. L. Xiao, Wangding Zeng, Wanjia Zhao, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, X. Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen, Xiaokang Zhang, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xinnan Song, Xinxia Shan, Xinyi Zhou, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Yang Zhang, Yanhong Xu, Yanhong Xu, Yanping Huang, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Li, Yaohui Wang, Yi Yu, Yi Zheng, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Ying Tang, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yu Wu, Yuan Ou, Yuchen Zhu, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yukun Zha, Yunfan Xiong, Yunxian Ma, Yuting Yan, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhen Huang, Zhen Zhang, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhipeng Xu, Zhiyu Wu, Zhongyu Zhang, Zhuoshu Li, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Ziyi Gao, Zizheng Pan
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with
671B total parameters with 37B activated for each token. To achieve efficient
inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent
Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated
in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free
strategy for load balancing and sets a multi-token prediction training
objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion
diverse and high-quality tokens, followed by Supervised Fine-Tuning and
Reinforcement Learning stages to fully harness its capabilities. Comprehensive
evaluations reveal that DeepSeek-V3 outperforms other open-source models and
achieves performance comparable to leading closed-source models. Despite its
excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its
full training. In addition, its training process is remarkably stable.
Throughout the entire training process, we did not experience any irrecoverable
loss spikes or perform any rollbacks. The model checkpoints are available at
https://github.com/deepseek-ai/DeepSeek-V3.
♻ ☆ CausalGraph2LLM: Evaluating LLMs for Causal Queries NAACL'25
Causality is essential in scientific research, enabling researchers to
interpret true relationships between variables. These causal relationships are
often represented by causal graphs, which are directed acyclic graphs. With the
recent advancements in Large Language Models (LLMs), there is an increasing
interest in exploring their capabilities in causal reasoning and their
potential use to hypothesize causal graphs. These tasks necessitate the LLMs to
encode the causal graph effectively for subsequent downstream tasks. In this
paper, we introduce CausalGraph2LLM, a comprehensive benchmark comprising over
700k queries across diverse causal graph settings to evaluate the causal
reasoning capabilities of LLMs. We categorize the causal queries into two
types: graph-level and node-level queries. We benchmark both open-sourced and
propriety models for our study. Our findings reveal that while LLMs show
promise in this domain, they are highly sensitive to the encoding used. Even
capable models like GPT-4 and Gemini-1.5 exhibit sensitivity to encoding, with
deviations of about $60\%$. We further demonstrate this sensitivity for
downstream causal intervention tasks. Moreover, we observe that LLMs can often
display biases when presented with contextual information about a causal graph,
potentially stemming from their parametric memory.
comment: NAACL'25 Findings, Code - https://github.com/ivaxi0s/CausalGraph2LLM
♻ ☆ Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization
Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun, Akshay Krishnamurthy, Dylan J. Foster
Language model alignment methods such as reinforcement learning from human
feedback (RLHF) have led to impressive advances in language model capabilities,
but are limited by a widely observed phenomenon known as overoptimization,
where the quality of the language model degrades over the course of the
alignment process. As the model optimizes performance with respect to an
offline reward model, it overfits to inaccuracies and drifts away from
preferred responses covered by the data. To discourage such distribution shift,
KL-regularization is widely employed in existing offline alignment methods, but
overoptimization continues to harm performance. Lending theoretical insight
into the source of these empirical observations, we first show that the
KL-regularization is too weak to prevent overfitting, then raise the following
question: is it possible to design an efficient algorithm that is provably
robust to overoptimization?
We address this question with a new algorithm for offline alignment,
$\chi^2$-Preference Optimization ($\chi$PO). $\chi$PO is a one-line change to
Direct Preference Optimization (DPO; Rafailov et al., 2023), which only
involves modifying the logarithmic link function in the DPO objective. Despite
this minimal change, $\chi$PO implicitly implements the principle of pessimism
in the face of uncertainty via regularization with the $\chi^2$-divergence --
which quantifies uncertainty more effectively than KL-regularization -- and
provably alleviates overoptimization, achieving sample-complexity guarantees
based on single-policy concentrability -- the gold standard in offline
reinforcement learning. $\chi$PO's simplicity and strong guarantees make it the
first practical and general-purpose offline alignment algorithm that is
provably robust to overoptimization.
♻ ☆ Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words ICLR2025
Sparse autoencoders (SAEs) have gained a lot of attention as a promising tool
to improve the interpretability of large language models (LLMs) by mapping the
complex superposition of polysemantic neurons into monosemantic features and
composing a sparse dictionary of words. However, traditional performance
metrics like Mean Squared Error and L0 sparsity ignore the evaluation of the
semantic representational power of SAEs -- whether they can acquire
interpretable monosemantic features while preserving the semantic relationship
of words. For instance, it is not obvious whether a learned sparse feature
could distinguish different meanings in one word. In this paper, we propose a
suite of evaluations for SAEs to analyze the quality of monosemantic features
by focusing on polysemous words. Our findings reveal that SAEs developed to
improve the MSE-L0 Pareto frontier may confuse interpretability, which does not
necessarily enhance the extraction of monosemantic features. The analysis of
SAEs with polysemous words can also figure out the internal mechanism of LLMs;
deeper layers and the Attention module contribute to distinguishing polysemy in
a word. Our semantics focused evaluation offers new insights into the polysemy
and the existing SAE objective and contributes to the development of more
practical SAEs.
comment: Published at ICLR2025
♻ ☆ Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable
abilities in complex reasoning tasks by scaling test-time compute and
exhibiting human-like deep thinking. However, we identify a phenomenon we term
underthinking, where o1-like LLMs frequently switch between different reasoning
thoughts without sufficiently exploring promising paths to reach a correct
solution. This behavior leads to inadequate depth of reasoning and decreased
performance, particularly on challenging mathematical problems. To
systematically analyze this issue, we conduct experiments on three challenging
test sets and two representative open-source o1-like models, revealing that
frequent thought switching correlates with incorrect responses. We introduce a
novel metric to quantify underthinking by measuring token efficiency in
incorrect answers. To address underthinking, we propose a decoding strategy
with thought switching penalty TIP that discourages premature transitions
between thoughts, encouraging deeper exploration of each reasoning path.
Experimental results demonstrate that our approach improves accuracy across
challenging datasets without requiring model fine-tuning. Our findings
contribute to understanding reasoning inefficiencies in o1-like LLMs and offer
a practical solution to enhance their problem-solving capabilities.
comment: 1. We have updated the results for DeepSeek-R1, and all of our
original conclusions remain valid. 2. Our proposed Tip approach remains
effective in Best-of-N scenarios (e.g., self-consistency and Laconic
Decoding) when built on DeepSeek-R1
♻ ☆ Towards Human Understanding of Paraphrase Types in Large Language Models
Paraphrases represent a human's intuitive ability to understand expressions
presented in various different ways. Current paraphrase evaluations of language
models primarily use binary approaches, offering limited interpretability of
specific text changes. Atomic paraphrase types (APT) decompose paraphrases into
different linguistic changes and offer a granular view of the flexibility in
linguistic expression (e.g., a shift in syntax or vocabulary used). In this
study, we assess the human preferences towards ChatGPT in generating English
paraphrases with ten APTs and five prompting techniques. We introduce APTY
(Atomic Paraphrase TYpes), a dataset of 800 sentence-level and word-level
annotations by 15 annotators. The dataset also provides a human preference
ranking of paraphrases with different types that can be used to fine-tune
models with RLHF and DPO methods. Our results reveal that ChatGPT and a
DPO-trained LLama 7B model can generate simple APTs, such as additions and
deletions, but struggle with complex structures (e.g., subordination changes).
This study contributes to understanding which aspects of paraphrasing language
models have already succeeded at understanding and what remains elusive. In
addition, we show how our curated datasets can be used to develop language
models with specific linguistic capabilities.
♻ ☆ On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists
On-device LLMs have gained increasing attention for their ability to enhance
privacy and provide a personalized user experience. To facilitate private
learning with scarce data, Federated Learning has become a standard approach.
However, it faces challenges such as computational resource heterogeneity and
data heterogeneity among end users. We propose CoMiGS ($\textbf{Co}$llaborative
learning with a $\textbf{Mi}$xture of $\textbf{G}$eneralists and
$\textbf{S}$pecialists), the first approach to address both challenges. A key
innovation of our method is the bi-level optimization formulation of the
Mixture-of-Experts learning objective, where the router is optimized using a
separate validation set to ensure alignment with the target distribution. We
solve our objective with alternating minimization, for which we provide a
theoretical analysis. Our method shares generalist experts across users while
localizing a varying number of specialist experts, thereby adapting to users'
computational resources and preserving privacy. Through extensive experiments,
we show CoMiGS effectively balances general and personalized knowledge for each
token generation. We demonstrate that CoMiGS remains robust against
overfitting-due to the generalists' regularizing effect-while adapting to local
data through specialist expertise. We open source our codebase for
collaborative LLMs.
♻ ☆ Lexical categories of stem-forming roots in Mapudüngun verb forms
After developing a computational system for morphological analysis of the
Mapuche language, and evaluating it with texts from various authors and styles,
it became necessary to verify the linguistic assumptions of the source used as
the basis for implementing this tool.
In the present work, the primary focus is on the lexical category
classification of Mapud\"ungun roots recognised as verbal in the source
utilised for the development of the morphological analysis system.
The results of this lexical category revision directly benefit the
computational analyser, as they are implemented as soon as they are verified.
Additionally, it is hoped that these results will help clarify some
uncertainties about lexical categories in the Mapuche language.
This work addresses a preliminary task to identify the valency of true verbal
roots, the results of which will be presented in a subsequent work that
complements this article.
comment: 22 pages, 2 large tables, 2 sample tables
♻ ☆ LADDER: Language Driven Slice Discovery and Error Rectification
Error slice discovery is crucial to diagnose and mitigate model errors.
Current clustering or discrete attribute-based slice discovery methods face key
limitations: 1) clustering results in incoherent slices, while assigning
discrete attributes to slices leads to incomplete coverage of error patterns
due to missing or insufficient attributes; 2) these methods lack complex
reasoning, preventing them from fully explaining model biases; 3) they fail to
integrate \textit{domain knowledge}, limiting their usage in specialized fields
\eg radiology. We propose\ladder (\underline{La}nguage-\underline{D}riven
\underline{D}iscovery and \underline{E}rror \underline{R}ectification), to
address the limitations by: (1) leveraging the flexibility of natural language
to address incompleteness, (2) employing LLM's latent \textit{domain knowledge}
and advanced reasoning to analyze sentences and derive testable hypotheses
directly, identifying biased attributes, and form coherent error slices without
clustering. Existing mitigation methods typically address only the
worst-performing group, often amplifying errors in other subgroups. In
contrast,\ladder generates pseudo attributes from the discovered hypotheses to
mitigate errors across all biases without explicit attribute annotations or
prior knowledge of bias. Rigorous evaluations on 6 datasets spanning natural
and medical images -- comparing 200+ classifiers with diverse architectures,
pretraining strategies, and LLMs -- show that\ladder consistently outperforms
existing baselines in discovering and mitigating biases.
♻ ☆ Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li
Autoregressive models (ARMs) are widely regarded as the cornerstone of large
language models (LLMs). We challenge this notion by introducing LLaDA, a
diffusion model trained from scratch under the pre-training and supervised
fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data
masking process and a reverse process, parameterized by a vanilla Transformer
to predict masked tokens. By optimizing a likelihood bound, it provides a
principled generative approach for probabilistic inference. Across extensive
benchmarks, LLaDA demonstrates strong scalability, outperforming our
self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong
LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive
instruction-following abilities in case studies such as multi-turn dialogue.
Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal
poem completion task. Our findings establish diffusion models as a viable and
promising alternative to ARMs, challenging the assumption that key LLM
capabilities discussed above are inherently tied to ARMs. Project page and
codes: https://ml-gsai.github.io/LLaDA-demo/.
♻ ☆ Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks AAAI 2025
In this paper, we propose $\textbf{Ne}$ural-$\textbf{Sy}$mbolic
$\textbf{C}$ollaborative $\textbf{D}$istillation ($\textbf{NesyCD}$), a novel
knowledge distillation method for learning the complex reasoning abilities of
Large Language Models (LLMs, e.g., \textgreater 13B). We argue that complex
reasoning tasks are difficult for Small Language Models (SLMs, e.g., $\leq$
7B), as these tasks demand not only general cognitive abilities but also
specialized knowledge, which is often sparse and difficult for these
neural-based SLMs to effectively capture. Therefore, NesyCD distills the
general capabilities and specialized knowledge in LLMs using different manners.
On the one hand, we distill only general abilities from teacher LLMs into the
student SLMs of parameterized neural networks. On the other hand, for the
specialized abilities and uncommon knowledge of a complex reasoning task, we
employ a symbolic knowledge distillation approach to obtain and store the
specialized knowledge within a symbolic knowledge base (KB). By decoupling
general and specialized capabilities, the proposed NesyCD can achieve superior
performance cost-effectively, utilizing smaller models and blending
parameterized neural networks with symbolic KB. Moreover, the specialized KB
generalizes well and is comprehended and manipulated by humans. Our experiments
show that NesyCD significantly boosts SLMs' complex reasoning performance on
in-domain (BBH, GSM8K) and out-of-domain (AGIEval, ARC) datasets. Notably, our
approach enabled the LLaMA3-8B and Qwen2-7B to surpass GPT-3.5-turbo in
performance and come close to matching LLaMA3-70B, despite the latter having
nine times more parameters. Our code will be available at
https://github.com/Xnhyacinth/NesyCD.
comment: Accepted to AAAI 2025
♻ ☆ From Instance Training to Instruction Learning: Task Adapters Generation from Instructions NeurIPS 2024
Large language models (LLMs) have acquired the ability to solve general tasks
by utilizing instruction finetuning (IFT). However, IFT still relies heavily on
instance training of extensive task data, which greatly limits the adaptability
of LLMs to real-world scenarios where labeled task instances are scarce and
broader task generalization becomes paramount. Contrary to LLMs, humans acquire
skills and complete tasks not merely through repeated practice but also by
understanding and following instructional guidelines. This paper is dedicated
to simulating human learning to address the shortcomings of instance training,
focusing on instruction learning to enhance cross-task generalization. Within
this context, we introduce Task Adapters Generation from Instructions (TAGI),
which automatically constructs the task-specific model in a parameter
generation manner based on the given task instructions without retraining for
unseen tasks. Specifically, we utilize knowledge distillation to enhance the
consistency between TAGI developed through Learning with Instruction and
task-specific models developed through Training with Instance, by aligning the
labels, output logits, and adapter parameters between them. TAGI is endowed
with cross-task generalization capabilities through a two-stage training
process that includes hypernetwork pretraining and finetuning. We evaluate TAGI
on the Super-Natural Instructions and P3 datasets. The experimental results
demonstrate that TAGI can match or even outperform traditional meta-trained
models and other hypernetwork models, while significantly reducing
computational requirements.
comment: accepted to NeurIPS 2024
♻ ☆ Distinguishing Ignorance from Error in LLM Hallucinations
Large language models (LLMs) are susceptible to hallucinations -- factually
incorrect outputs -- leading to a large body of work on detecting and
mitigating such cases. We argue that it is important to distinguish between two
types of hallucinations: ones where the model does not hold the correct answer
in its parameters, which we term HK-, and ones where the model answers
incorrectly despite having the required knowledge, termed HK+. We first find
that HK+ hallucinations are prevalent and occur across models and datasets.
Then, we demonstrate that distinguishing between these two cases is beneficial
for mitigating hallucinations. Importantly, we show that different models
hallucinate on different examples, which motivates constructing model-specific
hallucination datasets for training detectors. Overall, our findings draw
attention to classifying types of hallucinations and provide means to handle
them more effectively. The code is available at
https://github.com/technion-cs-nlp/hallucination-mitigation .
♻ ☆ Comparing Unidirectional, Bidirectional, and Word2vec Models for Discovering Vulnerabilities in Compiled Lifted Code
Ransomware and other forms of malware cause significant financial and
operational damage to organizations by exploiting long-standing and often
difficult-to-detect software vulnerabilities. To detect vulnerabilities such as
buffer overflows in compiled code, this research investigates the application
of unidirectional transformer-based embeddings, specifically GPT-2. Using a
dataset of LLVM functions, we trained a GPT-2 model to generate embeddings,
which were subsequently used to build LSTM neural networks to differentiate
between vulnerable and non-vulnerable code. Our study reveals that embeddings
from the GPT-2 model significantly outperform those from bidirectional models
of BERT and RoBERTa, achieving an accuracy of 92.5% and an F1-score of 89.7%.
LSTM neural networks were developed with both frozen and unfrozen embedding
model layers. The model with the highest performance was achieved when the
embedding layers were unfrozen. Further, the research finds that, in exploring
the impact of different optimizers within this domain, the SGD optimizer
demonstrates superior performance over Adam. Overall, these findings reveal
important insights into the potential of unidirectional transformer-based
approaches in enhancing cybersecurity defenses.
comment: 6 pages, 2 figures
♻ ☆ RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Qiyuan Zhang, Yufei Wang, Tiezheng YU, Yuxin Jiang, Chuhan Wu, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma
With significant efforts in recent studies, LLM-as-a-Judge has become a
cost-effective alternative to human evaluation for assessing text generation
quality in a wide range of tasks. However, there still remains a reliability
gap between LLM-as-a-Judge and human evaluation. One important reason is the
lack of guided oracles in the evaluation process. Motivated by the role of
reference pervasively used in classic text evaluation, we introduce RevisEval,
a novel text generation evaluation paradigm via the response-adapted
references. RevisEval is driven by the key observation that an ideal reference
should maintain the necessary relevance to the response to be evaluated.
Specifically, RevisEval leverages the text revision capabilities of large
language models (LLMs) to adaptively revise the response, then treat the
revised text as the reference (response-adapted reference) for the subsequent
evaluation. Extensive experiments demonstrate that RevisEval outperforms
traditional reference-free and reference-based evaluation paradigms that use
LLM-as-a-Judge across NLG tasks and open-ended instruction-following tasks.
More importantly, our response-adapted references can further boost the
classical text metrics, e.g., BLEU and BERTScore, compared to traditional
references and even rival the LLM-as-a-Judge. A detailed analysis is also
conducted to confirm RevisEval's effectiveness in bias reduction, the impact of
inference cost, and reference relevance.
♻ ☆ VAQUUM: Are Vague Quantifiers Grounded in Visual Data?
Vague quantifiers such as "a few" and "many" are influenced by many
contextual factors, including how many objects are present in a given context.
In this work, we evaluate the extent to which vision-and-language models (VLMs)
are compatible with humans when producing or judging the appropriateness of
vague quantifiers in visual contexts. We release a novel dataset, VAQUUM,
containing 20300 human ratings on quantified statements across a total of 1089
images. Using this dataset, we compare human judgments and VLM predictions
using three different evaluation methods. Our findings show that VLMs, like
humans, are influenced by object counts in vague quantifier use. However, we
find significant inconsistencies across models in different evaluation
settings, suggesting that judging and producing vague quantifiers rely on two
different processes.
comment: Under review, 12 pages for main paper (5 figures), 15 pages including
appendix (2 figures)
♻ ☆ Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge
Large Language Models (LLMs) have become increasingly powerful and
ubiquitous, but their stochastic nature poses challenges to the reliability of
their outputs. While deterministic settings can improve consistency, they do
not guarantee reliability, as a single sample from the model's probability
distribution can still be misleading. Building upon the concept of
LLM-as-a-judge, we introduce a novel framework for rigorously evaluating the
reliability of LLM judgments, leveraging McDonald's omega. We evaluate the
reliability of LLMs when judging the outputs of other LLMs on standard
single-turn and multi-turn benchmarks, simultaneously investigating the impact
of temperature on reliability. By analyzing these results, we demonstrate the
limitations of fixed randomness and the importance of considering multiple
samples, which we show has significant implications for downstream
applications. Our findings highlight the need for a nuanced understanding of
LLM reliability and the potential risks associated with over-reliance on
single-shot evaluations. This work provides a crucial step towards building
more trustworthy and reliable LLM-based systems and applications.
♻ ☆ ToxiLab: How Well Do Open-Source LLMs Generate Synthetic Toxicity Data?
Zheng Hui, Zhaoxiao Guo, Hang Zhao, Juanyong Duan, Lin Ai, Yinheng Li, Julia Hirschberg, Congrui Huang
Effective toxic content detection relies heavily on high-quality and diverse
data, which serve as the foundation for robust content moderation models.
Synthetic data has become a common approach for training models across various
NLP tasks. However, its effectiveness remains uncertain for highly subjective
tasks like hate speech detection, with previous research yielding mixed
results. This study explores the potential of open-source LLMs for harmful data
synthesis, utilizing controlled prompting and supervised fine-tuning techniques
to enhance data quality and diversity. We systematically evaluated 6 open
source LLMs on 5 datasets, assessing their ability to generate diverse,
high-quality harmful data while minimizing hallucination and duplication. Our
results show that Mistral consistently outperforms other open models, and
supervised fine-tuning significantly enhances data reliability and diversity.
We further analyze the trade-offs between prompt-based vs. fine-tuned toxic
data synthesis, discuss real-world deployment challenges, and highlight ethical
considerations. Our findings demonstrate that fine-tuned open source LLMs
provide scalable and cost-effective solutions to augment toxic content
detection datasets, paving the way for more accessible and transparent content
moderation tools.
comment: 14 pages
♻ ☆ LGDE: Local Graph-based Dictionary Expansion
We present Local Graph-based Dictionary Expansion (LGDE), a method for
data-driven discovery of the semantic neighbourhood of words using tools from
manifold learning and network science. At the heart of LGDE lies the creation
of a word similarity graph from the geometry of word embeddings followed by
local community detection based on graph diffusion. The diffusion in the local
graph manifold allows the exploration of the complex nonlinear geometry of word
embeddings to capture word similarities based on paths of semantic association,
over and above direct pairwise similarities. Exploiting such semantic
neighbourhoods enables the expansion of dictionaries of pre-selected keywords,
an important step for tasks in information retrieval, such as database queries
and online data collection. We validate LGDE on two user-generated
English-language corpora and show that LGDE enriches the list of keywords with
improved performance relative to methods based on direct word similarities or
co-occurrences. We further demonstrate our method through a real-world use case
from communication science, where LGDE is evaluated quantitatively on the
expansion of a conspiracy-related dictionary from online data collected and
analysed by domain experts. Our empirical results and expert user assessment
indicate that LGDE expands the seed dictionary with more useful keywords due to
the manifold-learning-based similarity network.
comment: Python code available at:
https://github.com/barahona-research-group/LGDE
♻ ☆ Reinforced Lifelong Editing for Language Models
Large language models (LLMs) acquire information from pre-training corpora,
but their stored knowledge can become inaccurate or outdated over time. Model
editing addresses this challenge by modifying model parameters without
retraining, and prevalent approaches leverage hypernetworks to generate these
parameter updates. However, they face significant challenges in lifelong
editing due to their incompatibility with LLM parameters that dynamically
change during the editing process. To address this, we observed that
hypernetwork-based lifelong editing aligns with reinforcement learning modeling
and proposed RLEdit, an RL-based editing method. By treating editing losses as
rewards and optimizing hypernetwork parameters at the full knowledge sequence
level, we enable it to precisely capture LLM changes and generate appropriate
parameter updates. Our extensive empirical evaluation across several LLMs
demonstrates that RLEdit outperforms existing methods in lifelong editing with
superior effectiveness and efficiency, achieving a 59.24% improvement while
requiring only 2.11% of the time compared to most approaches. Our code is
available at: https://github.com/zhrli324/RLEdit.
♻ ☆ The Responsible Development of Automated Student Feedback with Generative AI
Providing rich, constructive feedback to students is essential for supporting
and enhancing their learning. Recent advancements in Generative Artificial
Intelligence (AI), particularly with large language models (LLMs), present new
opportunities to deliver scalable, repeatable, and instant feedback,
effectively making abundant a resource that has historically been scarce and
costly. From a technical perspective, this approach is now feasible due to
breakthroughs in AI and Natural Language Processing (NLP). While the potential
educational benefits are compelling, implementing these technologies also
introduces a host of ethical considerations that must be thoughtfully
addressed. One of the core advantages of AI systems is their ability to
automate routine and mundane tasks, potentially freeing up human educators for
more nuanced work. However, the ease of automation risks a ``tyranny of the
majority'', where the diverse needs of minority or unique learners are
overlooked, as they may be harder to systematize and less straightforward to
accommodate. Ensuring inclusivity and equity in AI-generated feedback,
therefore, becomes a critical aspect of responsible AI implementation in
education. The process of developing machine learning models that produce
valuable, personalized, and authentic feedback also requires significant input
from human domain experts. Decisions around whose expertise is incorporated,
how it is captured, and when it is applied have profound implications for the
relevance and quality of the resulting feedback. Additionally, the maintenance
and continuous refinement of these models are necessary to adapt feedback to
evolving contextual, theoretical, and student-related factors. Without ongoing
adaptation, feedback risks becoming obsolete or mismatched with the current
needs of diverse student populations [...]
comment: Pre-print of version accepted to EDUCON 2025
♻ ☆ Improving Factuality with Explicit Working Memory
Mingda Chen, Yang Li, Karthik Padthe, Rulin Shao, Alicia Sun, Luke Zettlemoyer, Gargi Ghosh, Wen-tau Yih
Large language models can generate factually inaccurate content, a problem
known as hallucination. Recent works have built upon retrieved-augmented
generation to improve factuality through iterative prompting but these methods
are limited by the traditional RAG design. To address these challenges, we
introduce EWE (Explicit Working Memory), a novel approach that enhances
factuality in long-form text generation by integrating a working memory that
receives real-time feedback from external resources. The memory is refreshed
based on online fact-checking and retrieval feedback, allowing EWE to rectify
false claims during the generation process and ensure more accurate and
reliable outputs. Our experiments demonstrate that Ewe outperforms strong
baselines on four fact-seeking long-form generation datasets, increasing the
factuality metric, VeriScore, by 2 to 6 points absolute without sacrificing the
helpfulness of the responses. Further analysis reveals that the design of rules
for memory updates, configurations of memory units, and the quality of the
retrieval datastore are crucial factors for influencing model performance.
♻ ☆ Relation Also Knows: Rethinking the Recall and Editing of Factual Associations in Auto-Regressive Transformer Language Models AAAI25
The storage and recall of factual associations in auto-regressive transformer
language models (LMs) have drawn a great deal of attention, inspiring knowledge
editing by directly modifying the located model weights. Most editing works
achieve knowledge editing under the guidance of existing interpretations of
knowledge recall that mainly focus on subject knowledge. However, these
interpretations are seriously flawed, neglecting relation information and
leading to the over-generalizing problem for editing. In this work, we discover
a novel relation-focused perspective to interpret the knowledge recall of
transformer LMs during inference and apply it on single knowledge editing to
avoid over-generalizing. Experimental results on the dataset supplemented with
a new R-Specificity criterion demonstrate that our editing approach
significantly alleviates over-generalizing while remaining competitive on other
criteria, breaking the domination of subject-focused editing for future
research.
comment: Accepted by AAAI25
♻ ☆ Temporal reasoning for timeline summarisation in social media
This paper explores whether enhancing temporal reasoning capabilities in
Large Language Models (LLMs) can improve the quality of timeline summarisation,
the task of summarising long texts containing sequences of events, such as
social media threads. We first introduce NarrativeReason, a novel dataset
focused on temporal relationships among sequential events within narratives,
distinguishing it from existing temporal reasoning datasets that primarily
address pair-wise event relationships. Our approach then combines temporal
reasoning with timeline summarisation through a knowledge distillation
framework, where we first fine-tune a teacher model on temporal reasoning tasks
and then distill this knowledge into a student model while simultaneously
training it for the task of timeline summarisation. Experimental results
demonstrate that our model achieves superior performance on out-of-domain
mental health-related timeline summarisation tasks, which involve long social
media threads with repetitions of events and a mix of emotions, highlighting
the importance and generalisability of leveraging temporal reasoning to improve
timeline summarisation.
♻ ☆ Flexora: Flexible Low Rank Adaptation for Large Language Models
Large Language Models (LLMs) are driving advancements in artificial
intelligence by increasing the scale of model parameters, which has
significantly enhanced generalization ability and unlocked new capabilities in
practice. However, their performance in specific downstream tasks is usually
hindered by their knowledge boundaries on these tasks. Thus, fine-tuning
techniques, especially the widely used Low-Rank Adaptation (LoRA) method, have
been introduced to expand the boundaries on these tasks, whereas LoRA would
underperform on certain tasks owing to its potential overfitting on these
tasks. To overcome this overfitting and improve the performance of LoRA, we
propose the flexible low rank adaptation (Flexora) method to automatically and
flexibly select the most important layers needing to be fine-tuned to achieve
the best performance on different downstream tasks. Specifically, Flexora
firstly frames this layer selection problem as a well-defined hyperparameter
optimization (HPO) problem, then addresses it using the unrolled
differentiation (UD) method, and finally selects the most useful layers based
on the optimized hyperparameters. Our extensive experiments on many pretrained
models and natural language tasks show that Flexora is able to consistently
improve over the existing baselines, indicating the effectiveness of our
Flexora in practice. We additionally provide insightful theoretical results and
many ablation studies to deliver a comprehensive understanding of our Flexora.
comment: 39 pages, 15 figures
♻ ☆ Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System
Large Language Model (LLM) based multi-agent systems (MAS) show remarkable
potential in collaborative problem-solving, yet they still face critical
challenges: low communication efficiency, poor scalability, and a lack of
effective parameter-updating optimization methods. We present Optima, a novel
framework that addresses these issues by significantly enhancing both
communication efficiency and task effectiveness in LLM-based MAS through LLM
training. Optima employs an iterative generate, rank, select, and train
paradigm with a reward function balancing task performance, token efficiency,
and communication readability. We explore various RL algorithms, including
Supervised Fine-Tuning, Direct Preference Optimization, and their hybrid
approaches, providing insights into their effectiveness-efficiency trade-offs.
We integrate Monte Carlo Tree Search-inspired techniques for DPO data
generation, treating conversation turns as tree nodes to explore diverse
interaction paths. Evaluated on common multi-agent tasks, including
information-asymmetric question answering and complex reasoning, Optima shows
consistent and substantial improvements over single-agent baselines and vanilla
MAS based on Llama 3 8B, achieving up to 2.8x performance gain with less than
10\% tokens on tasks requiring heavy information exchange. Moreover, Optima's
efficiency gains open new possibilities for leveraging inference-compute more
effectively, leading to improved inference-time scaling laws. By addressing
fundamental challenges in LLM-based MAS, Optima shows the potential towards
scalable, efficient, and effective MAS
(https://chenweize1998.github.io/optima-project-page).
comment: Under review
♻ ☆ Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning
While multilingual large language models generally perform adequately, and
sometimes even rival English performance on high-resource languages (HRLs),
they often significantly underperform on low-resource languages (LRLs). Among
several prompting strategies aiming at bridging the gap, multilingual
in-context learning (ICL) has been particularly effective when demonstration in
target languages is unavailable. However, there lacks a systematic
understanding of when and why it works well.
In this work, we systematically analyze multilingual ICL, using
demonstrations in HRLs to enhance cross-lingual transfer. We show that
demonstrations in mixed HRLs consistently outperform English-only ones across
the board, particularly for tasks written in LRLs. Surprisingly, our ablation
study shows that the presence of irrelevant non-English sentences in the prompt
yields measurable gains, suggesting the effectiveness of multilingual exposure
itself. Our results highlight the potential of strategically leveraging
multilingual resources to bridge the performance gap for underrepresented
languages.
♻ ☆ The Mirage of Model Editing: Revisiting Evaluation in the Wild
Despite near-perfect results in artificial evaluations, the effectiveness of
model editing in real-world applications remains unexplored. To bridge this
gap, we propose to study model editing in question answering (QA) by
establishing a rigorous evaluation practice to assess the effectiveness of
editing methods in correcting LLMs' errors. It consists of QAEdit, a new
benchmark derived from popular QA datasets, and a standardized evaluation
framework. Our single editing experiments indicate that current editing methods
perform substantially worse than previously reported (38.5% vs. ~96%). Through
module analysis and controlled experiments, we demonstrate that this
performance decline stems from issues in evaluation practices of prior editing
research. One key issue is the inappropriate use of teacher forcing in testing
prevents error propagation by feeding ground truth tokens (inaccessible in
real-world scenarios) as input. Furthermore, we simulate real-world deployment
by sequential editing, revealing that current approaches fail drastically with
only 1000 edits. Our analysis provides a fundamental reexamination of both the
real-world applicability of existing model editing methods and their evaluation
practices, and establishes a rigorous evaluation framework with key insights to
advance reliable and practical model editing research.
♻ ☆ Towards Homogeneous Lexical Tone Decoding from Heterogeneous Intracranial Recordings ICLR2025
Recent advancements in brain-computer interfaces (BCIs) have enabled the
decoding of lexical tones from intracranial recordings, offering the potential
to restore the communication abilities of speech-impaired tonal language
speakers. However, data heterogeneity induced by both physiological and
instrumental factors poses a significant challenge for unified invasive brain
tone decoding. Traditional subject-specific models, which operate under a
heterogeneous decoding paradigm, fail to capture generalized neural
representations and cannot effectively leverage data across subjects. To
address these limitations, we introduce Homogeneity-Heterogeneity Disentangled
Learning for neural Representations (H2DiLR), a novel framework that
disentangles and learns both the homogeneity and heterogeneity from
intracranial recordings across multiple subjects. To evaluate H2DiLR, we
collected stereoelectroencephalography (sEEG) data from multiple participants
reading Mandarin materials comprising 407 syllables, representing nearly all
Mandarin characters. Extensive experiments demonstrate that H2DiLR, as a
unified decoding paradigm, significantly outperforms the conventional
heterogeneous decoding approach. Furthermore, we empirically confirm that
H2DiLR effectively captures both homogeneity and heterogeneity during neural
representation learning.
comment: ICLR2025 Poster (Preprint V2)
♻ ☆ Sorting the Babble in Babel: Assessing the Performance of Language Detection Algorithms on the OpenAlex Database
This project aims to compare various language classification procedures,
procedures combining various Python language detection algorithms and
metadata-based corpora extracted from manually-annotated articles sampled from
the OpenAlex database. Following an analysis of precision and recall
performance for each algorithm, corpus, and language as well as of processing
speeds recorded for each algorithm and corpus type, overall procedure
performance at the database level was simulated using probabilistic confusion
matrices for each algorithm, corpus, and language as well as a probabilistic
model of relative article language frequencies for the whole OpenAlex database.
Results show that procedure performance strongly depends on the importance
given to each of the measures implemented: for contexts where precision is
preferred, using the LangID algorithm on the greedy corpus gives the best
results; however, for all cases where recall is considered at least slightly
more important than precision or as soon as processing times are given any kind
of consideration, the procedure combining the FastSpell algorithm and the
Titles corpus outperforms all other alternatives. Given the lack of truly
multilingual, large-scale bibliographic databases, it is hoped that these
results help confirm and foster the unparalleled potential of the OpenAlex
database for cross-linguistic, bibliometric-based research and analysis.
comment: 33 pages, 4 figures
♻ ☆ Textual Unlearning Gives a False Sense of Unlearning
Language Models (LMs) are prone to ''memorizing'' training data, including
substantial sensitive user information. To mitigate privacy risks and safeguard
the right to be forgotten, machine unlearning has emerged as a promising
approach for enabling LMs to efficiently ''forget'' specific texts. However,
despite the good intentions, is textual unlearning really as effective and
reliable as expected? To address the concern, we first propose Unlearning
Likelihood Ratio Attack+ (U-LiRA+), a rigorous textual unlearning auditing
method, and find that unlearned texts can still be detected with very high
confidence after unlearning. Further, we conduct an in-depth investigation on
the privacy risks of textual unlearning mechanisms in deployment and present
the Textual Unlearning Leakage Attack (TULA), along with its variants in both
black- and white-box scenarios. We show that textual unlearning mechanisms
could instead reveal more about the unlearned texts, exposing them to
significant membership inference and data reconstruction risks. Our findings
highlight that existing textual unlearning actually gives a false sense of
unlearning, underscoring the need for more robust and secure unlearning
mechanisms.
♻ ☆ RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, Gabriel Synnaeve
Large language models (LLMs) deployed as agents solve user-specified tasks
over multiple steps while keeping the required manual engagement to a minimum.
Crucially, such LLMs need to ground their generations in any feedback obtained
to reliably achieve the desired outcomes. We propose an end-to-end
reinforcement learning method for teaching models to leverage execution
feedback in the realm of code synthesis, where state-of-the-art LLMs struggle
to improve code iteratively compared to independent sampling. We benchmark on
competitive programming tasks, where we achieve new state-of-the art results
with both small (8B parameters) and large (70B) models while reducing the
amount of samples required by an order of magnitude. Our analysis of
inference-time behavior demonstrates that our method produces LLMs that
effectively leverage automatic feedback over multiple steps.
comment: Add repair model ablation, update related work
♻ ☆ FRAME: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining Strategy
Xuemiao Zhang, Feiyu Duan, Liangyu Xu, Yongwei Zhou, Sirui Wang, Rongxiang Weng, Jingang Wang, Xunliang Cai
Large language models (LLMs) have significantly advanced human language
understanding and generation, with pretraining data quality and organization
being crucial to their performance. Multi-stage pretraining is a promising
approach, but existing methods often lack quantitative criteria for data
partitioning and instead rely on intuitive heuristics. In this paper, we
propose the novel Four-quadRAnt Multi-stage prEtraining strategy (FRAME),
guided by the established principle of organizing the pretraining process into
four stages to achieve significant loss reductions four times. This principle
is grounded in two key findings: first, training on high Perplexity (PPL) data
followed by low PPL data, and second, training on low PPL difference (PD) data
followed by high PD data, both causing the loss to drop significantly twice and
performance enhancements. By partitioning data into four quadrants and
strategically organizing them, FRAME achieves a remarkable 16.8% average
improvement over random across MMLU and CMMLU for the 3B model, effectively
boosting LLM performance.
♻ ☆ DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models
Recent advancements in Large Language Models (LLMs) have achieved robust
performance across diverse tasks, but fine-tuning these models for specific
domains remains resource-intensive. Parameter-Efficient Fine-Tuning (PEFT)
methods like Low-Rank Adaptation (LoRA) address this challenge by fine-tuning a
small subset of parameters. However, existing methods for fusing multiple LoRAs
lack dynamic fusion based on contextual inputs and often increase inference
time due to token-level operations. We propose DLP-LoRA, a Dynamic Lightweight
Plugin that employs a mini-MLP module with only 5M parameters to dynamically
fuse multiple LoRAs at the sentence level using top-p sampling strategies. This
approach reduces inference time to less than twice that of single LoRA
inference by leveraging parallel computation. Evaluations across 26
tasks-including multiple-choice questions and question answering-demonstrate
that DLP-LoRA achieves an average accuracy of 92.34% on multiple-choice
datasets and significant improvements in BLEU and ROUGE scores on QA datasets,
outperforming different LLMs backbones under composite task settings. DLP-LoRA
effectively balances performance and efficiency, making it a practical solution
for dynamic multi-task adaptation in LLMs. Our code is available at
https://github.com/MeCuping/DLP-LoRA.
comment: Preprint under review, 18 pages, 7 figures
♻ ☆ On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation
Xueru Wen, Jie Lou, Xinyu Lu, Ji Yuqiu, Xinyan Guan, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Debing Zhang, Le Sun
Hallucination occurs when large language models exhibit behavior that
deviates from the boundaries of their knowledge during response generation. To
address this critical issue, previous learning-based methods attempt to
finetune models but are limited by off-policy sampling and coarse-grained
feedback. In this paper, we present \textit{\b{R}einforcement \b{L}earning
\b{f}or \b{H}allucination} (RLFH), an on-policy self-alignment approach that
enables LLMs to actively explore their knowledge boundaries and self-correct
generation behavior through fine-grained feedback signals. RLFH introduces a
self-assessment framework where the policy serves as its own judge. Through
this framework, responses are automatically decomposed into atomic facts and
their truthfulness and informativeness are assessed against external knowledge
sources. The resulting fine-grained feedback at the statement level are then
converted into token-level dense reward signals. This enables online
reinforcement learning to achieve precise and timely optimization without human
intervention. Comprehensive evaluations on HotpotQA, SQuADv2, and Biography
benchmarks validate RLFH's effectiveness in hallucination mitigation.
♻ ☆ Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs NAACL 2025
Large Language Models (LLMs) have demonstrated impressive performance on a
wide range of natural language processing (NLP) tasks, primarily through
in-context learning (ICL). In ICL, the LLM is provided with examples that
represent a given task such that it learns to generate answers for test inputs.
However, access to these in-context examples is not guaranteed especially for
low-resource or massively multilingual tasks. In this work, we propose an
unsupervised approach to mine in-context examples for machine translation (MT),
enabling unsupervised MT (UMT) across different languages. Our approach begins
with word-level mining to acquire word translations that are then used to
perform sentence-level mining. As the quality of mined parallel pairs may not
be optimal due to noise or mistakes, we introduce a filtering criterion to
select the optimal in-context examples from a pool of unsupervised parallel
sentences. We evaluate our approach using two multilingual LLMs on 288
directions from the FLORES-200 dataset and analyze the impact of various
linguistic features on performance. Our findings demonstrate the effectiveness
of our unsupervised approach in mining in-context examples for MT, leading to
better or comparable translation performance as translation with regular
in-context samples (extracted from human-annotated data), while also
outperforming the other state-of-the-art UMT methods by an average of $7$ BLEU
points.
comment: Accepted at NAACL 2025
♻ ☆ JOOCI: a Framework for Learning Comprehensive Speech Representations ICLR 2025
Information in speech can be categorized into two groups: Content (what is
being said, such as linguistics) and Other (how it is expressed such as
information about speaker and paralinguistic features). Current self-supervised
learning (SSL) methods are shown to divide the model's representational-depth
or layers in two, with earlier layers specializing in Other and later layers in
Content related tasks. This layer-wise division is inherently sub-optimal, as
neither information type can use all layers to build hierarchical
representations. To address this, we propose JOOCI, a novel speech
representation learning method that does not compromise on the
representational-depth for either information type. JOOCI outperforms WavLM by
26.5%, and other models of similar size (100M parameters), when evaluated on
two speaker recognition and two language tasks from the SUPERB benchmark,
demonstrating its effectiveness in Jointly Optimizing Other and Content
Information (JOOCI).
comment: Submitted to ICLR 2025
♻ ☆ CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models
Large Language Models (LLMs) achieve remarkable performance through
pretraining on extensive data. This enables efficient adaptation to diverse
downstream tasks. However, the lack of interpretability in their underlying
mechanisms limits the ability to effectively steer LLMs for specific
applications. In this work, we investigate the intrinsic mechanisms of LLMs
from a cognitive perspective using eye movement measures. Specifically, we
analyze the layer-wise correlation between human cognitive indicators and LLM
representations. Building on these insights, we propose a heuristic approach
for selecting the optimal steering layer to modulate LLM semantics. To this
end, we introduce an efficient selective layer intervention based on prominent
parameter-efficient fine-tuning methods, which conventionally adjust either all
layers or only the final layer. Additionally, we present an implicit layer
contrastive intervention during inference to steer LLMs away from toxic
outputs. Extensive experiments on natural language understanding, reasoning,
and generation tasks, conducted on GPT-2, LLaMa2-7B, and Mixtral-7B,
demonstrate the effectiveness and efficiency of our approach. As a
model-agnostic framework, it enhances the interpretability of LLMs while
improving efficiency for safe deployment.
♻ ☆ ALGEN: Few-shot Inversion Attacks on Textual Embeddings using Alignment and Generation
With the growing popularity of Large Language Models (LLMs) and vector
databases, private textual data is increasingly processed and stored as
numerical embeddings. However, recent studies have proven that such embeddings
are vulnerable to inversion attacks, where original text is reconstructed to
reveal sensitive information. Previous research has largely assumed access to
millions of sentences to train attack models, e.g., through data leakage or
nearly unrestricted API access. With our method, a single data point is
sufficient for a partially successful inversion attack. With as little as 1k
data samples, performance reaches an optimum across a range of black-box
encoders, without training on leaked data. We present a Few-shot Textual
Embedding Inversion Attack using ALignment and GENeration (ALGEN), by aligning
victim embeddings to the attack space and using a generative model to
reconstruct text. We find that ALGEN attacks can be effectively transferred
across domains and languages, revealing key information. We further examine a
variety of defense mechanisms against ALGEN, and find that none are effective,
highlighting the vulnerabilities posed by inversion attacks. By significantly
lowering the cost of inversion and proving that embedding spaces can be aligned
through one-step optimization, we establish a new textual embedding inversion
paradigm with broader applications for embedding alignment in NLP.
comment: 18 pages, 13 tables, 6 figures
♻ ☆ MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations
In recent years, self-supervised pre-training methods have gained significant
traction in learning high-level information from raw speech. Among these
methods, HuBERT has demonstrated SOTA performance in automatic speech
recognition (ASR). However, HuBERT's performance lags behind data2vec due to
disparities in pre-training strategies. In this paper, we propose (i) a Swap
method to address pre-training and inference mismatch observed in HuBERT and
(ii) incorporates Multicluster masked prediction loss for more effective
utilization of the models capacity. The resulting method is, MS-HuBERT, an
end-to-end self-supervised pre-training method for learning robust speech
representations. It beats vanilla HuBERT on the ASR Librispeech benchmark on
average by a 5% margin when evaluated on different finetuning splits.
Additionally, we demonstrate that the learned embeddings obtained during
pre-training encode essential information for improving performance of content
based tasks such as ASR.
comment: 4 pages, submitted to interspeech2024
♻ ☆ Taxonomy and Analysis of Sensitive User Queries in Generative AI Search NAACL2025
Hwiyeol Jo, Taiwoo Park, Hyunwoo Lee, Nayoung Choi, Changbong Kim, Ohjoon Kwon, Donghyeon Jeon, Eui-Hyeon Lee, Kyoungho Shin, Sun Suk Lim, Kyungmi Kim, Jihye Lee, Sun Kim
Although there has been a growing interest among industries in integrating
generative LLMs into their services, limited experience and scarcity of
resources act as a barrier in launching and servicing large-scale LLM-based
services. In this paper, we share our experiences in developing and operating
generative AI models within a national-scale search engine, with a specific
focus on the sensitiveness of user queries. We propose a taxonomy for sensitive
search queries, outline our approaches, and present a comprehensive analysis
report on sensitive queries from actual users. We believe that our experiences
in launching generative AI search systems can contribute to reducing the
barrier in building generative LLM-based services.
comment: NAACL2025(Findings)
♻ ☆ StepTool: Enhancing Multi-Step Tool Usage in LLMs through Step-Grained Reinforcement Learning
Despite powerful text generation capabilities, large language models (LLMs)
still need to learn how to utilize external tools to solve complex tasks, a
process known as tool learning. Existing methods primarily rely on supervised
fine-tuning to enhance tool-use capabilities, treating tool learning as a
text-generation task while overlooking the decision-making complexities
inherent in multi-step contexts. In this work, we propose modeling tool
learning as a dynamic decision-making task and introduce StepTool, a novel
step-grained reinforcement learning framework that enhances the multi-step tool
use capabilities of LLMs. StepTool consists of two main components:
Step-grained Reward Shaping, which assigns rewards at each tool interaction
based on the success of tool invocation and its contribution to the task; and
Step-grained Optimization, which uses policy gradient methods to optimize the
model in a multi-step manner. Experimental results demonstrate that StepTool
significantly outperforms existing methods in multi-step, tool-based tasks,
offering a robust solution for tool learning.
comment: Ongoning Work
♻ ☆ Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!
Multimodal Large Language Models (MLLMs) have achieved significant
advancements in tasks like Visual Question Answering (VQA) by leveraging
foundational Large Language Models (LLMs). However, their abilities in specific
areas such as visual temporal understanding, which is crucial for comprehending
real-world dynamics, remain underexplored. To address this, we propose a
challenging evaluation benchmark named TemporalVQA, consisting of two parts: 1)
Temporal Order Understanding and 2) Time-lapse Estimation. The first part
requires MLLMs to determine the sequence of events by analyzing temporally
consecutive video frames. The second part presents image pairs with varying
time differences, framed as multiple-choice questions, asking MLLMs to estimate
the time-lapse between images with options ranging from seconds to years. Our
evaluations of advanced MLLMs, including models like GPT-4o and Gemini-1.5-Pro,
reveal significant challenges: GPT-4o achieved only 49.1% average consistent
accuracy in temporal order task and 70% in time-lapse estimation, with
open-source models performing even poorly. These findings underscore the
limitations of current MLLMs in visual temporal understanding and reasoning,
highlighting the need for further improvements for their temporal capability.
Our dataset can be found at
https://huggingface.co/datasets/fazliimam/temporal-vqa.
comment: Our dataset can be found at
\url{https://huggingface.co/datasets/fazliimam/temporal-vqa}
♻ ☆ Truth Knows No Language: Evaluating Truthfulness Beyond English
Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri
We introduce a professionally translated extension of the TruthfulQA
benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and
Spanish. Truthfulness evaluations of large language models (LLMs) have
primarily been conducted in English. However, the ability of LLMs to maintain
truthfulness across languages remains under-explored. Our study evaluates 12
state-of-the-art open LLMs, comparing base and instruction-tuned models using
human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our
findings reveal that, while LLMs perform best in English and worst in Basque
(the lowest-resourced language), overall truthfulness discrepancies across
languages are smaller than anticipated. Furthermore, we show that
LLM-as-a-Judge correlates more closely with human judgments than
multiple-choice metrics, and that informativeness plays a critical role in
truthfulness assessment. Our results also indicate that machine translation
provides a viable approach for extending truthfulness benchmarks to additional
languages, offering a scalable alternative to professional translation.
Finally, we observe that universal knowledge questions are better handled
across languages than context- and time-dependent ones, highlighting the need
for truthfulness evaluations that account for cultural and temporal
variability. Dataset and code are publicly available under open licenses.
comment: 14 pages, 6 figures, 8 tables
♻ ☆ CPRM: A LLM-based Continual Pre-training Framework for Relevance Modeling in Commercial Search NAACL 2025
Kaixin Wu, Yixin Ji, Zeyuan Chen, Qiang Wang, Cunxiang Wang, Hong Liu, Baijun Ji, Jia Xu, Zhongyi Liu, Jinjie Gu, Yuan Zhou, Linjian Mo
Relevance modeling between queries and items stands as a pivotal component in
commercial search engines, directly affecting the user experience. Given the
remarkable achievements of large language models (LLMs) in various natural
language processing (NLP) tasks, LLM-based relevance modeling is gradually
being adopted within industrial search systems. Nevertheless, foundational LLMs
lack domain-specific knowledge and do not fully exploit the potential of
in-context learning. Furthermore, structured item text remains underutilized,
and there is a shortage in the supply of corresponding queries and background
knowledge. We thereby propose CPRM (Continual Pre-training for Relevance
Modeling), a framework designed for the continual pre-training of LLMs to
address these issues. Our CPRM framework includes three modules: 1) employing
both queries and multi-field item to jointly pre-train for enhancing domain
knowledge, 2) applying in-context pre-training, a novel approach where LLMs are
pre-trained on a sequence of related queries or items, and 3) conducting
reading comprehension on items to produce associated domain knowledge and
background information (e.g., generating summaries and corresponding queries)
to further strengthen LLMs. Results on offline experiments and online A/B
testing demonstrate that our model achieves convincing performance compared to
strong baselines.
comment: NAACL 2025
♻ ☆ CartesianMoE: Boosting Knowledge Sharing among Experts via Cartesian Product Routing in Mixture-of-Experts NAACL2025
Zhenpeng Su, Xing Wu, Zijia Lin, Yizhe Xiong, Minxuan Lv, Guangyuan Ma, Hui Chen, Songlin Hu, Guiguang Ding
Large language models (LLM) have been attracting much attention from the
community recently, due to their remarkable performance in all kinds of
downstream tasks. According to the well-known scaling law, scaling up a dense
LLM enhances its capabilities, but also significantly increases the
computational complexity. Mixture-of-Experts (MoE) models address that by
allowing the model size to grow without substantially raising training or
inference costs. Yet MoE models face challenges regarding knowledge sharing
among experts, making their performance somehow sensitive to routing accuracy.
To tackle that, previous works introduced shared experts and combined their
outputs with those of the top $K$ routed experts in an ``addition'' manner. In
this paper, inspired by collective matrix factorization to learn shared
knowledge among data, we propose CartesianMoE, which implements more effective
knowledge sharing among experts in more like a ``multiplication'' manner.
Extensive experimental results indicate that CartesianMoE outperforms previous
MoE models for building LLMs, in terms of both perplexity and downstream task
performance. And we also find that CartesianMoE achieves better expert routing
robustness.
comment: NAACL2025 Main
♻ ☆ VividMed: Vision Language Model with Versatile Visual Grounding for Medicine
Recent advancements in Vision Language Models (VLMs) have demonstrated
remarkable promise in generating visually grounded responses. However, their
application in the medical domain is hindered by unique challenges. For
instance, most VLMs rely on a single method of visual grounding, whereas
complex medical tasks demand more versatile approaches. Additionally, while
most VLMs process only 2D images, a large portion of medical images are 3D. The
lack of medical data further compounds these obstacles. To address these
challenges, we present VividMed, a vision language model with versatile visual
grounding for medicine. Our model supports generating both semantic
segmentation masks and instance-level bounding boxes, and accommodates various
imaging modalities, including both 2D and 3D data. We design a three-stage
training procedure and an automatic data synthesis pipeline based on open
datasets and models. Besides visual grounding tasks, VividMed also excels in
other common downstream tasks, including Visual Question Answering (VQA) and
report generation. Ablation studies empirically show that the integration of
visual grounding ability leads to improved performance on these tasks. Our code
is publicly available at https://github.com/function2-llx/MMMM.
♻ ☆ The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives
This paper introduces the concept of an education tool that utilizes
Generative Artificial Intelligence (GenAI) to enhance storytelling for
children. The system combines GenAI-driven narrative co-creation,
text-to-speech conversion, and text-to-video generation to produce an engaging
experience for learners. We describe the co-creation process, the adaptation of
narratives into spoken words using text-to-speech models, and the
transformation of these narratives into contextually relevant visuals through
text-to-video technology. Our evaluation covers the linguistics of the
generated stories, the text-to-speech conversion quality, and the accuracy of
the generated visuals.
♻ ☆ Development of Application-Specific Large Language Models to Facilitate Research Ethics Review
Sebastian Porsdam Mann, Joel Seah Jiehao, Stephen R. Latham, Julian Savulescu, Mateo Aboy, Brian D. Earp
Institutional review boards (IRBs) play a crucial role in ensuring the
ethical conduct of human subjects research, but face challenges including
inconsistency, delays, and inefficiencies. We propose the development and
implementation of application-specific large language models (LLMs) to
facilitate IRB review processes. These IRB-specific LLMs would be fine-tuned on
IRB-specific literature and institutional datasets, and equipped with retrieval
capabilities to access up-to-date, context-relevant information. We outline
potential applications, including pre-review screening, preliminary analysis,
consistency checking, and decision support. While addressing concerns about
accuracy, context sensitivity, and human oversight, we acknowledge remaining
challenges such as over-reliance on AI and the need for transparency. By
enhancing the efficiency and quality of ethical review while maintaining human
judgment in critical decisions, IRB-specific LLMs offer a promising tool to
improve research oversight. We call for pilot studies to evaluate the
feasibility and impact of this approach.
comment: 11 pages, 0 figures
♻ ☆ Richer Output for Richer Countries: Uncovering Geographical Disparities in Generated Stories and Travel Recommendations NAACL
While a large body of work inspects language models for biases concerning
gender, race, occupation and religion, biases of geographical nature are
relatively less explored. Some recent studies benchmark the degree to which
large language models encode geospatial knowledge. However, the impact of the
encoded geographical knowledge (or lack thereof) on real-world applications has
not been documented. In this work, we examine large language models for two
common scenarios that require geographical knowledge: (a) travel
recommendations and (b) geo-anchored story generation. Specifically, we study
five popular language models, and across about $100$K travel requests, and
$200$K story generations, we observe that travel recommendations corresponding
to poorer countries are less unique with fewer location references, and stories
from these regions more often convey emotions of hardship and sadness compared
to those from wealthier nations.
comment: Findings of NAACL (2025)
♻ ☆ With a Grain of SALT: Are LLMs Fair Across Social Dimensions?
This paper presents a systematic analysis of biases in open-source Large
Language Models (LLMs), across gender, religion, and race. Our study evaluates
bias in smaller-scale Llama and Gemma models using the SALT ($\textbf{S}$ocial
$\textbf{A}$ppropriateness in $\textbf{L}$LM-Generated $\textbf{T}$ext)
dataset, which incorporates five distinct bias triggers: General Debate,
Positioned Debate, Career Advice, Problem Solving, and CV Generation. To
quantify bias, we measure win rates in General Debate and the assignment of
negative roles in Positioned Debate. For real-world use cases, such as Career
Advice, Problem Solving, and CV Generation, we anonymize the outputs to remove
explicit demographic identifiers and use DeepSeek-R1 as an automated evaluator.
We also address inherent biases in LLM-based evaluation, including evaluation
bias, positional bias, and length bias, and validate our results through human
evaluations. Our findings reveal consistent polarization across models, with
certain demographic groups receiving systematically favorable or unfavorable
treatment. By introducing SALT, we provide a comprehensive benchmark for bias
analysis and underscore the need for robust bias mitigation strategies in the
development of equitable AI systems.